Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AniClipart: Clipart Animation with Text-to-Video Priors (2404.12347v2)

Published 18 Apr 2024 in cs.CV and cs.GR

Abstract: Clipart, a pre-made art form, offers a convenient and efficient way of creating visual content. However, traditional workflows for animating static clipart are laborious and time-consuming, involving steps like rigging, keyframing, and inbetweening. Recent advancements in text-to-video generation hold great potential in resolving this challenge. Nevertheless, direct application of text-to-video models often struggles to preserve the visual identity of clipart or generate cartoon-style motion, resulting in subpar animation outcomes. In this paper, we introduce AniClipart, a computational system that converts static clipart into high-quality animations guided by text-to-video priors. To generate natural, smooth, and coherent motion, we first parameterize the motion trajectories of the keypoints defined over the initial clipart image by cubic B\'ezier curves. We then align these motion trajectories with a given text prompt by optimizing a video Score Distillation Sampling (SDS) loss and a skeleton fidelity loss. By incorporating differentiable As-Rigid-As-Possible (ARAP) shape deformation and differentiable rendering, AniClipart can be end-to-end optimized while maintaining deformation rigidity. Extensive experimental results show that the proposed AniClipart consistently outperforms the competing methods, in terms of text-video alignment, visual identity preservation, and temporal consistency. Additionally, we showcase the versatility of AniClipart by adapting it to generate layered animations, which allow for topological changes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. As-rigid-as-possible shape interpolation. Conference on Computer Graphics and Interactive Techniques, 1–8.
  2. Skeleton extraction by mesh contraction. ACM Transactions on Graphics 27, 3, 1–10.
  3. Lumiere: A Space-Time Diffusion Model for Video Generation. arXiv preprint arXiv:2401.12945.
  4. Ilya Baran and Jovan Popović. 2007. Automatic rigging and animation of 3d characters. ACM Transactions on Graphics 26, 3, 72–es.
  5. N-way morphing for 2D animation. Computer Animation and Virtual Worlds 20, 2-3, 79–87.
  6. Rigid shape interpolation using normal equations. International Symposium on Non-Photorealistic Animation and Rendering, 59–64.
  7. Align Your Latents: High-Resolution video synthesis with latent diffusion models. IEEE Conference on Computer Vision and Pattern Recognition, 22563–22575.
  8. Turning to the Masters: Motion capturing cartoons. ACM Transactions on Graphics 21, 3, 1–9.
  9. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512.
  10. VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models. arXiv preprint arXiv:2401.09047.
  11. Fantasia3d: Disentangling Geometry and Appearance for High-Quality Text-to-3D Content Creation. arXiv preprint arXiv:2303.13873.
  12. Planar shape interpolation with bounded distortion. ACM Transactions on Graphics 32, 4, 1–12.
  13. Shuhong Chen and Matthias Zwicker. 2022. Improving the perceptual quality of 2d animation interpolation. European Conference on Computer Vision, 271–287.
  14. Fine-Grained Open Domain Image Animation with Motion Guidance. arXiv preprint arXiv:2311.12886.
  15. Christina N. DeJuan and Bobby Bodenheimer. 2006. Re-using Traditional Animation: Methods for semi-automatic segmentation and inbetweening. Eurographics symposium on Computer animation, 223–232.
  16. Tooncap: A layered deformable model for capturing poses from cartoon characters. Joint Symposium on Computational Aesthetics and Sketch-Based Interfaces and Modeling and Non-Photorealistic Animation and Rendering, 1–12.
  17. Sven Forstmann and Jun Ohya. 2006. Fast Skeletal Animation by skinned Arc-Spline based Deformation. Eurographics, 1–4.
  18. Tsukasa Fukusato and Akinobu Maejima. 2022. View-Dependent Deformation for 2.5-D Cartoon Models. Computer Graphics and Applications 42, 5, 66–75.
  19. Exploring inbetween charts with trajectory-guided sliders for cutout animation. Multimedia Tools and Applications, 1–14.
  20. Tsukasa Fukusato and Shigeo Morishima. 2016. Active comicing for freehand drawing animation. Mathematical Progress in Expressive Image Synthesis, 45–56.
  21. Breathing Life Into Sketches Using Text-to-Video Priors. arXiv preprint arXiv:2311.13608.
  22. Preserve Your Own Correlation: A noise prior for video diffusion models. IEEE International Conference on Computer Vision, 22930–22941.
  23. Emu Video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709.
  24. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725.
  25. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662.
  26. Imagen Video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303.
  27. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868.
  28. Character animation from 2D pictures and 3D motion data. ACM Transactions on Graphics 26, 1, 1–9.
  29. L1-medial skeleton of point cloud. ACM Transactions on Graphics 32, 4, 65–1.
  30. Real-time intermediate flow estimation for video frame interpolation. European Conference on Computer Vision, 624–642.
  31. As-rigid-as-possible shape manipulation. ACM Transactions on Graphics 24, 3, 1134–1141.
  32. Word-as-Image for Semantic Typography. arXiv preprint arXiv:2303.01818.
  33. Bounded biharmonic weights for real-time deformation. ACM Transactions on Graphics 30, 4, 78.
  34. VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models. arXiv preprint arXiv:2211.11319.
  35. Super Slomo: High quality estimation of multiple intermediate frames for video interpolation. IEEE Conference on Computer Vision and Pattern Recognition, 9000–9008.
  36. RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose. arXiv preprint arXiv:2303.07399.
  37. Mathematical analysis on affine maps for 2D shape interpolation. Eurographics Symposium on Computer Animation, 71–76.
  38. Skinning with dual quaternions. Symposium on Interactive 3D Graphics and Games, 39–46.
  39. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.
  40. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125.
  41. Binh H. Le and JP Lewis. 2019. Direct delta mush skinning and variants. ACM Transactions on Graphics 38, 4, 1–13.
  42. Learning skeletal articulations with neural blend shapes. ACM Transactions on Graphics 40, 4, 1–15.
  43. Differentiable Vector Graphics Rasterization for Editing and Learning. ACM Transactions on Graphics 39, 6, 1–15.
  44. Deep sketch-guided cartoon video inbetweening. IEEE Transactions on Visualization and Computer Graphics 28, 8, 2938–2952.
  45. NeuroSkinning: Automatic skin binding for production characters with deep graph networks. ACM Transactions on Graphics 38, 4, 1–12.
  46. Video frame synthesis using deep voxel flow. IEEE International Conference on Computer Vision, 4463–4471.
  47. Video frame interpolation with transformer. IEEE Conference on Computer Vision and Pattern Recognition, 3532–3542.
  48. DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience 21, 9, 1281–1289.
  49. Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. IEEE Conference on Computer Vision and Pattern Recognition, 12663–12673.
  50. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. European Conference on Computer Vision.
  51. Animal Kingdom: A large and diverse dataset for animal behavior understanding. IEEE Conference on Computer Vision and Pattern Recognition, 19023–19034.
  52. Expanding language-image pretrained models for general video recognition. European Conference on Computer Vision, 1–18.
  53. Conditional Image-to-Video Generation with Latent Flow Diffusion Models. IEEE Conference on Computer Vision and Pattern Recognition, 18444–18455.
  54. Simon Niklaus and Feng Liu. 2018. Context-aware synthesis for video frame interpolation. IEEE Conference on Computer Vision and Pattern Recognition, 1701–1710.
  55. Simon Niklaus and Feng Liu. 2020. Softmax splatting for video frame interpolation. IEEE International Conference on Computer Vision, 5437–5446.
  56. Video frame interpolation via adaptive convolution. IEEE Conference on Computer Vision and Pattern Recognition, 670–679.
  57. Video frame interpolation via adaptive separable convolution. IEEE International Conference on Computer Vision, 261–270.
  58. BMBC: Bilateral motion estimation with bilateral cost volume for video interpolation. European Conference on Computer Vision, 109–125.
  59. Dreamfusion: Text-to-3D using 2D Diffusion. arXiv preprint arXiv:2209.14988.
  60. SketchDreamer: Interactive Text-Augmented Creative Sketch Ideation. arXiv preprint arXiv:2308.14191.
  61. Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning, 8748–8763.
  62. FILM: Frame interpolation for large motion. European Conference on Computer Vision, 250–266.
  63. High-Resolution Image Synthesis with Latent Diffusion Models. IEEE Conference on Computer Vision and Pattern Recognition, 10684–10695.
  64. Jonathan Richard Shewchuk. 1996. Triangle: Engineering a 2D quality mesh generator and Delaunay triangulator. Workshop on Applied Computational Geometry, 203–222.
  65. XVFI: Extreme video frame interpolation. IEEE International Conference on Computer Vision, 14489–14498.
  66. Make-a-Video: Text-to-Video generation without text-video data. arXiv preprint arXiv:2209.14792.
  67. Deep Geometrized Cartoon Line Inbetweening. IEEE International Conference on Computer Vision, 7291–7300.
  68. Deep animation video interpolation in the wild. IEEE Conference on Computer Vision and Pattern Recognition, 6587–6595.
  69. A Method for Animating Children’s Drawings of the Human Figure. ACM Transactions on Graphics 42, 3, 1–15.
  70. Live Sketch: Video-Driven dynamic deformation of static drawings. Conference on Human Factors in Computing Systems, 1–12.
  71. UniAP: Towards Universal Animal Perception in Vision via Few-shot Learning. arXiv preprint arXiv:2308.09953.
  72. Mean curvature skeletons. Computer Graphics Forum 31, 5, 1735–1744.
  73. DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion. arXiv preprint arXiv:2303.09604.
  74. AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising. arXiv preprint arXiv:2402.03549.
  75. TextMesh: Generation of Realistic 3D Meshes From Text Prompts. arXiv preprint arXiv:2304.12439.
  76. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399.
  77. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571.
  78. VideoComposer: Compositional Video Synthesis with Motion Controllability. arXiv preprint arXiv:2306.02018.
  79. Betweenit: An interactive tool for tight inbetweening. Computer Graphics Forum 29, 2, 605–614.
  80. Pose2Pose: Pose selection and transfer for 2D character animation. International Conference on Intelligent User Interfaces, 88–99.
  81. Nüwa: Visual synthesis pre-training for neural visual world creation. European Conference on Computer Vision, 720–736.
  82. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190.
  83. DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models. arXiv preprint arXiv:2306.14685.
  84. Quadratic video interpolation. Advances in Neural Information Processing Systems 32.
  85. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems 35, 38571–38584.
  86. RigNet: Neural rigging for articulated characters. arXiv preprint arXiv:2005.00559.
  87. Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model. arXiv preprint arXiv:2305.12252.
  88. UniPose: Detecting Any Keypoints. arXiv preprint arXiv:2310.08530.
  89. SuperAnimal models pretrained for plug-and-play analysis of animal behavior. arXiv preprint arXiv:2203.07436.
  90. Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution. IEEE Winter Conference on Applications of Computer Vision, 489–496.
  91. I2vgen-xl: High-Quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ronghuan Wu (4 papers)
  2. Wanchao Su (9 papers)
  3. Kede Ma (57 papers)
  4. Jing Liao (100 papers)
Citations (2)

Summary

AniClipart: Enhancing Clipart Animation with Text-to-Video Priors

Introduction

AniClipart introduces a novel approach to animating static clipart images using text-to-video (T2V) priors to dictate motion trajectories. This research leverages advancements in text-to-video diffusion models, aiming to simplify the animation process while preserving the artistic identity of the clipart. The system outlines a method for defining motion using Bézier curves tied to key points on the clipart, optimized through a Video Score Distillation Sampling (VSDS) loss. This enables the generation of animations that are not only smooth and visually coherent but also respectful of the clipart's original style.

Methodology

AniClipart employs several innovative steps to achieve its objectives:

  • Keypoint and Skeleton Detection: Utilizes advanced detection algorithms to identify crucial points and establish a skeletal framework on the clipart, which guides subsequent animations.
  • Bézier-driven Animation: Motion trajectories for each keypoint are represented as Bézier curves, enabling controlled and smooth animations.
  • Loss Functions: Incorporates VSDS loss to ensure movements are in line with specified text prompts. A skeleton preservation loss is also used to maintain structural integrity throughout the animation.

Key innovations include the use of ARAP (As-Rigid-As-Possible) shape manipulation to maintain the rigidity and identity of the clipart during animation. The system's end-to-end optimization capability allows for the efficient tweaking of animation dynamics according to textual descriptions.

Experimental Setup and Results

Extensive experiments demonstrate that AniClipart outperforms existing image-to-video models in various aspects:

  • Text-Video Alignment: Ensures that the generated animations are aligned with the text prompts, reflecting the intended motions accurately.
  • Visual Identity Preservation: Successfully retains the original aesthetic and structural details of the clipart, a notable improvement over traditional methods that may distort during the animation process.

The system was tested across multiple clipart categories, including humans, animals, and objects, showing its versatility and robustness. Comparison with conventional methods highlights AniClipart's enhanced capability to preserve visual identity and produce semantically meaningful animations.

Implications and Future Work

The development of AniClipart has both practical and theoretical implications for the field of automatic animation:

  • Reduction in Manual Effort: By automating key aspects of the animation process, AniClipart significantly reduces the time and effort traditionally required to animate cliparts.
  • Broadened Applicability: The method's success with diverse clipart suggests potential applications in other forms of graphic animations, such as educational tools, presentations, and entertainment media.

Looking ahead, potential enhancements could include adapting the system for 3D animation, improving the model's ability to handle complex motion patterns, and refining the text-to-motion alignment to capture nuanced textual descriptions more effectively.

Conclusions

AniClipart represents a significant step forward in the automation of clipart animation, driven by cutting-edge AI techniques. By bridging text-to-video models with clipart animation, this research not only simplifies the animation process but also enhances the creative possibilities, making high-quality animation more accessible. Future developments in this area are poised to further revolutionize how graphical content is animated and used across various digital platforms.

Youtube Logo Streamline Icon: https://streamlinehq.com