AniClipart: Clipart Animation with Text-to-Video Priors (2404.12347v2)
Abstract: Clipart, a pre-made art form, offers a convenient and efficient way of creating visual content. However, traditional workflows for animating static clipart are laborious and time-consuming, involving steps like rigging, keyframing, and inbetweening. Recent advancements in text-to-video generation hold great potential in resolving this challenge. Nevertheless, direct application of text-to-video models often struggles to preserve the visual identity of clipart or generate cartoon-style motion, resulting in subpar animation outcomes. In this paper, we introduce AniClipart, a computational system that converts static clipart into high-quality animations guided by text-to-video priors. To generate natural, smooth, and coherent motion, we first parameterize the motion trajectories of the keypoints defined over the initial clipart image by cubic B\'ezier curves. We then align these motion trajectories with a given text prompt by optimizing a video Score Distillation Sampling (SDS) loss and a skeleton fidelity loss. By incorporating differentiable As-Rigid-As-Possible (ARAP) shape deformation and differentiable rendering, AniClipart can be end-to-end optimized while maintaining deformation rigidity. Extensive experimental results show that the proposed AniClipart consistently outperforms the competing methods, in terms of text-video alignment, visual identity preservation, and temporal consistency. Additionally, we showcase the versatility of AniClipart by adapting it to generate layered animations, which allow for topological changes.
- As-rigid-as-possible shape interpolation. Conference on Computer Graphics and Interactive Techniques, 1–8.
- Skeleton extraction by mesh contraction. ACM Transactions on Graphics 27, 3, 1–10.
- Lumiere: A Space-Time Diffusion Model for Video Generation. arXiv preprint arXiv:2401.12945.
- Ilya Baran and Jovan Popović. 2007. Automatic rigging and animation of 3d characters. ACM Transactions on Graphics 26, 3, 72–es.
- N-way morphing for 2D animation. Computer Animation and Virtual Worlds 20, 2-3, 79–87.
- Rigid shape interpolation using normal equations. International Symposium on Non-Photorealistic Animation and Rendering, 59–64.
- Align Your Latents: High-Resolution video synthesis with latent diffusion models. IEEE Conference on Computer Vision and Pattern Recognition, 22563–22575.
- Turning to the Masters: Motion capturing cartoons. ACM Transactions on Graphics 21, 3, 1–9.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512.
- VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models. arXiv preprint arXiv:2401.09047.
- Fantasia3d: Disentangling Geometry and Appearance for High-Quality Text-to-3D Content Creation. arXiv preprint arXiv:2303.13873.
- Planar shape interpolation with bounded distortion. ACM Transactions on Graphics 32, 4, 1–12.
- Shuhong Chen and Matthias Zwicker. 2022. Improving the perceptual quality of 2d animation interpolation. European Conference on Computer Vision, 271–287.
- Fine-Grained Open Domain Image Animation with Motion Guidance. arXiv preprint arXiv:2311.12886.
- Christina N. DeJuan and Bobby Bodenheimer. 2006. Re-using Traditional Animation: Methods for semi-automatic segmentation and inbetweening. Eurographics symposium on Computer animation, 223–232.
- Tooncap: A layered deformable model for capturing poses from cartoon characters. Joint Symposium on Computational Aesthetics and Sketch-Based Interfaces and Modeling and Non-Photorealistic Animation and Rendering, 1–12.
- Sven Forstmann and Jun Ohya. 2006. Fast Skeletal Animation by skinned Arc-Spline based Deformation. Eurographics, 1–4.
- Tsukasa Fukusato and Akinobu Maejima. 2022. View-Dependent Deformation for 2.5-D Cartoon Models. Computer Graphics and Applications 42, 5, 66–75.
- Exploring inbetween charts with trajectory-guided sliders for cutout animation. Multimedia Tools and Applications, 1–14.
- Tsukasa Fukusato and Shigeo Morishima. 2016. Active comicing for freehand drawing animation. Mathematical Progress in Expressive Image Synthesis, 45–56.
- Breathing Life Into Sketches Using Text-to-Video Priors. arXiv preprint arXiv:2311.13608.
- Preserve Your Own Correlation: A noise prior for video diffusion models. IEEE International Conference on Computer Vision, 22930–22941.
- Emu Video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725.
- Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662.
- Imagen Video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868.
- Character animation from 2D pictures and 3D motion data. ACM Transactions on Graphics 26, 1, 1–9.
- L1-medial skeleton of point cloud. ACM Transactions on Graphics 32, 4, 65–1.
- Real-time intermediate flow estimation for video frame interpolation. European Conference on Computer Vision, 624–642.
- As-rigid-as-possible shape manipulation. ACM Transactions on Graphics 24, 3, 1134–1141.
- Word-as-Image for Semantic Typography. arXiv preprint arXiv:2303.01818.
- Bounded biharmonic weights for real-time deformation. ACM Transactions on Graphics 30, 4, 78.
- VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models. arXiv preprint arXiv:2211.11319.
- Super Slomo: High quality estimation of multiple intermediate frames for video interpolation. IEEE Conference on Computer Vision and Pattern Recognition, 9000–9008.
- RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose. arXiv preprint arXiv:2303.07399.
- Mathematical analysis on affine maps for 2D shape interpolation. Eurographics Symposium on Computer Animation, 71–76.
- Skinning with dual quaternions. Symposium on Interactive 3D Graphics and Games, 39–46.
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125.
- Binh H. Le and JP Lewis. 2019. Direct delta mush skinning and variants. ACM Transactions on Graphics 38, 4, 1–13.
- Learning skeletal articulations with neural blend shapes. ACM Transactions on Graphics 40, 4, 1–15.
- Differentiable Vector Graphics Rasterization for Editing and Learning. ACM Transactions on Graphics 39, 6, 1–15.
- Deep sketch-guided cartoon video inbetweening. IEEE Transactions on Visualization and Computer Graphics 28, 8, 2938–2952.
- NeuroSkinning: Automatic skin binding for production characters with deep graph networks. ACM Transactions on Graphics 38, 4, 1–12.
- Video frame synthesis using deep voxel flow. IEEE International Conference on Computer Vision, 4463–4471.
- Video frame interpolation with transformer. IEEE Conference on Computer Vision and Pattern Recognition, 3532–3542.
- DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience 21, 9, 1281–1289.
- Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. IEEE Conference on Computer Vision and Pattern Recognition, 12663–12673.
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. European Conference on Computer Vision.
- Animal Kingdom: A large and diverse dataset for animal behavior understanding. IEEE Conference on Computer Vision and Pattern Recognition, 19023–19034.
- Expanding language-image pretrained models for general video recognition. European Conference on Computer Vision, 1–18.
- Conditional Image-to-Video Generation with Latent Flow Diffusion Models. IEEE Conference on Computer Vision and Pattern Recognition, 18444–18455.
- Simon Niklaus and Feng Liu. 2018. Context-aware synthesis for video frame interpolation. IEEE Conference on Computer Vision and Pattern Recognition, 1701–1710.
- Simon Niklaus and Feng Liu. 2020. Softmax splatting for video frame interpolation. IEEE International Conference on Computer Vision, 5437–5446.
- Video frame interpolation via adaptive convolution. IEEE Conference on Computer Vision and Pattern Recognition, 670–679.
- Video frame interpolation via adaptive separable convolution. IEEE International Conference on Computer Vision, 261–270.
- BMBC: Bilateral motion estimation with bilateral cost volume for video interpolation. European Conference on Computer Vision, 109–125.
- Dreamfusion: Text-to-3D using 2D Diffusion. arXiv preprint arXiv:2209.14988.
- SketchDreamer: Interactive Text-Augmented Creative Sketch Ideation. arXiv preprint arXiv:2308.14191.
- Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning, 8748–8763.
- FILM: Frame interpolation for large motion. European Conference on Computer Vision, 250–266.
- High-Resolution Image Synthesis with Latent Diffusion Models. IEEE Conference on Computer Vision and Pattern Recognition, 10684–10695.
- Jonathan Richard Shewchuk. 1996. Triangle: Engineering a 2D quality mesh generator and Delaunay triangulator. Workshop on Applied Computational Geometry, 203–222.
- XVFI: Extreme video frame interpolation. IEEE International Conference on Computer Vision, 14489–14498.
- Make-a-Video: Text-to-Video generation without text-video data. arXiv preprint arXiv:2209.14792.
- Deep Geometrized Cartoon Line Inbetweening. IEEE International Conference on Computer Vision, 7291–7300.
- Deep animation video interpolation in the wild. IEEE Conference on Computer Vision and Pattern Recognition, 6587–6595.
- A Method for Animating Children’s Drawings of the Human Figure. ACM Transactions on Graphics 42, 3, 1–15.
- Live Sketch: Video-Driven dynamic deformation of static drawings. Conference on Human Factors in Computing Systems, 1–12.
- UniAP: Towards Universal Animal Perception in Vision via Few-shot Learning. arXiv preprint arXiv:2308.09953.
- Mean curvature skeletons. Computer Graphics Forum 31, 5, 1735–1744.
- DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion. arXiv preprint arXiv:2303.09604.
- AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising. arXiv preprint arXiv:2402.03549.
- TextMesh: Generation of Realistic 3D Meshes From Text Prompts. arXiv preprint arXiv:2304.12439.
- Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571.
- VideoComposer: Compositional Video Synthesis with Motion Controllability. arXiv preprint arXiv:2306.02018.
- Betweenit: An interactive tool for tight inbetweening. Computer Graphics Forum 29, 2, 605–614.
- Pose2Pose: Pose selection and transfer for 2D character animation. International Conference on Intelligent User Interfaces, 88–99.
- Nüwa: Visual synthesis pre-training for neural visual world creation. European Conference on Computer Vision, 720–736.
- Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190.
- DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models. arXiv preprint arXiv:2306.14685.
- Quadratic video interpolation. Advances in Neural Information Processing Systems 32.
- Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems 35, 38571–38584.
- RigNet: Neural rigging for articulated characters. arXiv preprint arXiv:2005.00559.
- Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model. arXiv preprint arXiv:2305.12252.
- UniPose: Detecting Any Keypoints. arXiv preprint arXiv:2310.08530.
- SuperAnimal models pretrained for plug-and-play analysis of animal behavior. arXiv preprint arXiv:2203.07436.
- Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution. IEEE Winter Conference on Applications of Computer Vision, 489–496.
- I2vgen-xl: High-Quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145.
- Ronghuan Wu (4 papers)
- Wanchao Su (9 papers)
- Kede Ma (57 papers)
- Jing Liao (100 papers)