Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers (1911.11502v1)

Published 26 Nov 2019 in cs.CV, cs.LG, and eess.AS

Abstract: Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of large-scale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features from the lip movement videos. In this paper, we propose a new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen lip reading by learning from speech recognizers. The rationale behind our approach is that the features extracted from speech recognizers may provide complementary and discriminant clues, which are formidable to be obtained from the subtle movements of the lips, and consequently facilitate the training of lip readers. This is achieved, specifically, by distilling multi-granularity knowledge from speech recognizers to lip readers. To conduct this cross-modal knowledge distillation, we utilize an efficacious alignment scheme to handle the inconsistent lengths of the audios and videos, as well as an innovative filtering strategy to refine the speech recognizer's prediction. The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7.66% and 2.75% in character error rate, respectively.

Authors (6)

Ya Zhao (6 papers)
Rui Xu (199 papers)
Xinchao Wang (203 papers)
Peng Hou (6 papers)
Haihong Tang (14 papers)
Mingli Song (163 papers)

Citations (80)

View on Semantic Scholar

Summary

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

This essay provides an overview of the paper "Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers" authored by Ya Zhao, Rui Xu, Xinchao Wang, Peng Hou, Haihong Tang, and Mingli Song. The premise of the research centers on enhancing the performance of lip reading systems, which are often overshadowed by their audio-based counterparts due to the inherent ambiguity associated with lip movements. The paper introduces a novel method denoted as Lip by Speech (LIBS), which seeks to improve lip reading accuracy through the distillation of knowledge from advanced speech recognizers.

Lip reading, or visual speech recognition, involves extracting and decoding spoken text from visual data of lip movements. Despite advancements facilitated by deep learning and vast datasets, the challenge remains significant due to the nuanced nature of lip articulation and the difficulty in deriving distinct features. The paper proposes utilizing a pre-trained speech recognizer to augment lip readers by distilling multi-granularity knowledge—knowledge characterized by sequence-level, context-level, and frame-level features.

Three critical challenges underpin this endeavor: the asynchronous nature of audio and video sampling rates, imperfect predictions by speech recognizers, and the need for fine-grained knowledge transfer across modalities. The LIBS method addresses these by implementing a robust alignment scheme to synchronize audio and video data to facilitate effective cross-modal knowledge distillation. Moreover, the authors introduce a filtering strategy, leveraging the Longest Common Subsequence (LCS) method, to refine the predictions made by speech recognizers, selectively using reliable information that supports lip reading accuracy.

Experimentation conducted using two prominent datasets—CMLR and LRS2—demonstrated that LIBS achieves state-of-the-art performance, significantly reducing Character Error Rate (CER) by 7.66% and 2.75% compared to baseline models. These results underscore the potent effect of combining visual and acoustic information in lip reading systems. Notably, the efficacy of LIBS becomes even more pronounced when the available training data is limited. This adaptive quality indicates potential application in scenarios or environments where acquiring large datasets is impractical.

The theoretical implications extend to enhancing the depth of cross-modal learning, demonstrating a concrete pathway for integrating heterogeneous modalities to elevate model accuracy. Practically, the insights gleaned from this research could translate into more robust lip reading applications across various fields, such as security and accessibility technologies.

Future research avenues may explore leveraging this method with other forms of multimodal interactions, such as integrating sign language recognition with audio cues. Moreover, expanding this knowledge distillation framework to incorporate more sophisticated models, like transformers, instead of recurrent networks, may yield further improvements in sequence-to-sequence architecture outcomes.

Overall, this paper contributes an innovative perspective to the lip reading domain, offering a refined method that bridges the gap between visual and auditory speech processing.

PDF Markdown

Related Papers

YouTube

Show All Videos