Real-Time Lip Sync for Live 2D Animation (1910.08685v1)

Published 19 Oct 2019 in cs.GR, cs.CV, cs.HC, and cs.LG

Abstract: The emergence of commercial tools for real-time performance-based 2D animation has enabled 2D characters to appear on live broadcasts and streaming platforms. A key requirement for live animation is fast and accurate lip sync that allows characters to respond naturally to other actors or the audience through the voice of a human performer. In this work, we present a deep learning based interactive system that automatically generates live lip sync for layered 2D characters using a Long Short Term Memory (LSTM) model. Our system takes streaming audio as input and produces viseme sequences with less than 200ms of latency (including processing time). Our contributions include specific design decisions for our feature definition and LSTM configuration that provide a small but useful amount of lookahead to produce accurate lip sync. We also describe a data augmentation procedure that allows us to achieve good results with a very small amount of hand-animated training data (13-20 minutes). Extensive human judgement experiments show that our results are preferred over several competing methods, including those that only support offline (non-live) processing. Video summary and supplementary results at GitHub link: https://github.com/deepalianeja/CharacterLipSync2D

Citations (16)

View on Semantic Scholar

Summary

The paper introduces a method using a unidirectional LSTM network to convert streaming audio into discrete viseme sequences for live animation.
It achieves real-time synchronization with less than 200ms latency by employing MFCCs, log energy features, and novel data augmentation with TIMIT recordings.
The system outperforms commercial tools in live tests while requiring only 13-20 minutes of curated training data for competitive lip sync quality.

Real-Time Lip Sync for Live 2D Animation: A Summary

The paper "Real-Time Lip Sync for Live 2D Animation" presents a significant advancement in the domain of performance-based animation by introducing a method for generating live lip sync for 2D animated characters using a deep learning approach with an LSTM model. This paper addresses the demand for a fast and reliable lip sync system that aligns the mouth movements of cartoon characters with real-time streamed audio, a feature necessary for live broadcasts and interactive media.

Key elements of the proposed system include its capability to operate with less than 200ms latency, ensuring timely and believable interactions, and requiring only a modest amount of hand-animated training data. The authors achieve these results by fine-tuning an LSTM-based architecture, optimizing input features, and applying innovative data augmentation strategies.

Methodology

The authors utilize a unidirectional single-layer LSTM network that converts streaming audio input into discrete viseme sequences at 24fps. The model employs a targeted feature representation consisting of MFCCs, log energy, and their derivatives to enhance prediction accuracy. Notably, the incorporation of a temporal shift and small amount of lookahead adds robustness to the detection of viseme transitions.

A distinctive aspect of the paper is the novel data augmentation approach based on dynamic time warping. By leveraging TIMIT corpus recordings, multiple speaker renditions of the same sentences are aligned to a single hand-animated sequence, significantly amplifying the training dataset while maintaining stylistic integrity.

Experimental Results

The paper provides a comprehensive evaluation through human judgment experiments, comparing the proposed method against both live and offline commercial systems like Adobe Character Animator and ToonBoom. The results consistently favor the authors' system across various test scenarios, demonstrating superior accuracy and reliability in live settings.

The system's training efficiency is another notable outcome; with just 13-20 minutes of curated lip sync data required, the model is capable of producing competitive lip sync quality, which is further enhanced by the data augmentation technique.

Implications and Future Directions

This contribution has practical implications for live animation workflows, facilitating more natural and engaging character performances in real-time settings. On a theoretical level, the paper demonstrates the potential of LSTM networks to model complex temporal dependencies in artistic contexts such as animation.

Looking forward, there are several areas ripe for exploration. These include enhancing the robustness of the model to diverse audio inputs, such as background noise or speech variations, and exploring fine-tuning techniques for specific animation styles. Moreover, developing a perceptually-driven loss function may refine the system further by prioritizing more impactful discrepancies in visual lip sync quality.

In conclusion, the research provides a robust framework for live 2D lip animation, positioning it as an advanced tool for both current live animation systems and future applications where merging AI with artistic processes continues to break new ground.

PDF Markdown

Related Papers

GitHub

GitHub - deepalianeja/CharacterLipSync: Real-Time Lip Sync for Live 2D Animation (131 stars)

Tweets

https://twitter.com/jonathanfly/status/1186794163372806144

https://twitter.com/deepalianeja/status/1187138249384435713