The paper introduces a novel approach leveraging video diffusion models to generate animated stickers from static images conditioned on text prompts.
It discusses the development of a two-stage finetuning pipeline with a human-in-the-loop strategy to enhance motion quality and achieve stylistic fidelity.
Key innovations include the use of an ensemble-of-teachers strategy, middle-frame conditioning, and motion bucketing for training improvements.
The research aims at bridging the domain gap in animating stickers, optimizing inference for practical application, and proposes directions for future enhancements.
The domain of video diffusion models has witnessed considerable advancements, but the application to short, stylized video generation—such as animating stickers based on text and images—has remained relatively unexplored. In addressing this, we introduce a novel approach that employs a video diffusion model designed to generate animated stickers from static images conditioned on text prompts. This approach leverages the capabilities of the Emu text-to-image model, augmenting it with temporal layers to capture motion, thereby translating static stickers into dynamic, expressive animations suitable for social media usage.
The central challenge faced in this domain is the significant gap between the visual and motion characteristics of natural videos and those expected in sticker-style animations. The paper outlines the development of a two-stage finetuning pipeline that effectively bridges this gap. The initial stage involves finetuning with weakly in-domain data, while the subsequent stage utilizes a human-in-the-loop strategy with an ensemble of teachers to distill desirable motion qualities into a more compact student model. This paper details the methodology behind this approach, the architecture of the model, and the finetuning processes employed to ensure high-quality motion and stylistic fidelity in the generated animations.
The paper's contributions are manifold, presenting not only a new application of video diffusion models but also introducing several innovations to the generative AI field. Specifically:
The development of an end-to-end process for creating animated stickers from static images and text prompts, leveraging the strengths of video diffusion models.
An ensemble-of-teachers strategy for human-in-the-loop finetuning that significantly enhances motion quality and fidelity to the sticker's original style.
The introduction of video-specific training improvements, such as middle-frame conditioning and motion bucketing, which collectively elevate the model's output quality.
The model's architecture is grounded in latent diffusion models (LDM), with crucial modifications to incorporate temporal dynamics and dual conditioning on both text and images. This setup ensures that the generated animations not only exhibit high-quality motion but also remain true to the content and style dictated by the input image and accompanying text prompt.
A significant portion of the research focused on overcoming the domain gap problem. Traditional I2V models, when directly applied to the task of animating stickers, resulted in subpar animations characterized by trivial motion patterns or inconsistencies. To address this, the study deployed a two-phase finetuning mechanism. Initially, the model is exposed to a curated dataset of animations that, while not perfectly aligned with the target domain, offer a closer visual and motion context compared to generic video data. The subsequent phase involves generating a plethora of animations using different teacher models under the guidance of human annotators. The selected high-quality outputs form a specialized dataset for training the final student model, effectively tailoring its capabilities to produce animated stickers with desirable motion characteristics.
Recognizing the importance of efficiency for deployment in a production environment, the paper discusses several inference optimization techniques. These include architectural modifications, precision adjustments, and the innovative use of distillation methods to enhance the speed of video generation without a noticeable compromise on output quality.
The paper concludes with a reflection on potential future improvements, such as extending the animation length, automating the creation of smoothly looping animations, and further refinement of the model to enhance motion quality. The research presented marks a significant step forward in the application of generative AI for social media content creation, offering a scalable method to produce engaging, dynamic stickers that can enrich online communication.
In summary, this work not only showcases the feasibility of using advanced video diffusion models for animating stickers but also lays down a comprehensive framework for tackling the inherent challenges in adapting these models to specific, stylized applications. It represents a pivotal contribution to the field of generative AI, opening up new avenues for research and practical applications in digital media.