Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sub-word Level Lip Reading With Visual Attention

Published 14 Oct 2021 in cs.CV and cs.CL | (2110.07603v2)

Abstract: The goal of this paper is to learn strong lip reading models that can recognise speech in silent videos. Most prior works deal with the open-set visual speech recognition problem by adapting existing automatic speech recognition techniques on top of trivially pooled visual features. Instead, in this paper we focus on the unique challenges encountered in lip reading and propose tailored solutions. To this end, we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a model for Visual Speech Detection (VSD), trained on top of the lip reading network. Following the above, we obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets, and even surpass models trained on large-scale industrial datasets by using an order of magnitude less data. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models, significantly reducing the performance gap between lip reading and automatic speech recognition. Moreover, on the AVA-ActiveSpeaker benchmark, our VSD model surpasses all visual-only baselines and even outperforms several recent audio-visual methods.

Citations (79)

Summary

  • The paper introduces a novel visual transformer pooling module and sub-word tokenization to efficiently decode silent video speech.
  • The proposed method achieves a 22.6% word error rate on LRS2, outperforming models trained on larger datasets.
  • The study extends applications to visual speech detection, offering significant improvements over existing visual-only baselines.

A Detailed Summary of "Sub-word Level Lip Reading With Visual Attention"

Introduction and Motivation

The paper "Sub-word Level Lip Reading With Visual Attention" (2110.07603) addresses the challenging task of lip reading—decoding speech from silent video inputs. Unlike audio-based automatic speech recognition (ASR), lip reading requires processing high-dimensional video inputs, which involve substantial temporal and spatial complexity. The authors highlight practical applications, such as enhancing speech recognition in noisy environments and aiding communication for speech-impaired individuals. The inherent difficulty of lip reading due to homophemes and the necessity of video-specific tailoring—rather than merely adapting ASR techniques—motivate the need for a more specialized approach.

Methodology

Visual Backbone and Attention-based Pooling

The authors introduce an innovative visual backbone designed to address the intricacies of lip reading by focusing on visual encoding and tokenization strategies. Their approach includes a novel attention-based pooling mechanism to efficiently aggregate visual speech representations. This module, termed Visual Transformer Pooling (VTP), dynamically learns to focus on relevant spatial areas in the video inputs, enhancing feature aggregation over traditional methods like Global Average Pooling (GAP). Figure 1

Figure 1: Proposed lip reading architecture. {\em Left: The input video frames are processed through a spatio-temporal CNN and a Visual Transformer Pooling module for feature extraction.}

Sub-word Tokenization

For the first time in lip reading, the authors employ sub-word (word-piece) tokens instead of traditional character-level outputs. This choice leverages the semantic richness and efficiency of sub-words, reducing sequence length and encoding inherent language priors, thus decreasing the model's dependency on extensive language modeling.

Experimental Results

The authors conduct extensive experiments on public datasets such as LRS2 and LRS3, achieving state-of-the-art results with their proposed architecture. Notably, their best model attains a 22.6% word error rate (WER) on LRS2, surpassing models trained on significantly larger datasets. The introduction of sub-word tokenization and the VTP module demonstrates considerable improvements in performance, highlighting the model's superior data efficiency.

Applications and Implications

Visual Speech Detection

Beyond lip reading, the visual backbone is leveraged for Visual Speech Detection (VSD), a crucial step for inferring speech activity from video-only inputs. The VSD model, built on top of the lip reading encoder, significantly outperforms existing visual-only baselines, even challenging some audio-visual methods. Figure 2

Figure 2: Visual Speech Detection pipeline.

Future Directions and Ethical Considerations

The research positions itself as a foundational advancement in visual speech technology, offering both theoretical and practical implications. Future work could explore further integration with contextual and multimodal information to enhance robustness. The authors transparently address potential ethical concerns, such as privacy implications of lip-reading technology, emphasizing the intention to release their models to encourage research transparency and democratization.

Conclusion

"Sub-word Level Lip Reading With Visual Attention" advances the frontier of lip reading by introducing tailored architectural innovations and leveraging sub-word tokenization. This research not only presents empirically validated improvements but also extends the applicability of visual speech technology, underscoring the continued convergence of visual and language processing in machine learning.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

What this paper is about (big picture)

This paper builds a smarter computer system that can “read lips,” meaning it turns silent videos of people talking into written text. The authors improve how the system looks at the mouth area, how it turns mouth movements into words, and how it figures out when someone is speaking—using only the video. Their goal is to make lip reading more accurate and less dependent on huge amounts of training data.

What questions the researchers asked

  • How can we design a lip-reading model that is tailored to video (not just copied from audio speech systems)?
  • Can we help the model “look” at the right parts of the face in each frame, instead of averaging everything?
  • Instead of predicting one character at a time, can we predict slightly bigger chunks of words (sub-words) to make reading lip movements easier and faster?
  • Can we detect when someone is speaking just from video (no sound), so we know which parts of a clip to transcribe?

How they did it (in simple terms)

Think of the system like a very focused viewer and a smart reader working together:

  • A “spotlight” for the face (attention-based pooling):
    • In each video frame, there’s a lot to look at—lips, cheeks, jaw, and more.
    • The model uses an attention mechanism (like a learnable spotlight) to automatically focus on the most important areas for understanding speech, frame by frame. This helps it “track” the mouth movements even if the person turns their head.
  • Reading in chunks (sub-words/word-pieces):
    • Instead of predicting text letter-by-letter (which can be slow and confusing when letters look similar on lips), the model predicts sub-words—small pieces of words like “play-” and “-ing.”
    • This shortens the output sequence, speeds up training and inference, and gives the model helpful hints about language (since sub-words carry meaning and patterns).
  • A video-only speech detector:
    • They add a simple head on top of the lip-reading encoder to decide frame by frame if the person is speaking.
    • This is like a video version of “voice activity detection” but without audio.
  • Training approach:
    • Stage 1: Train on short two-word clips to learn the basics.
    • Stage 2: Freeze the visual part, then train the language part on longer chunks. This is a simpler, more efficient version of older training curriculums.
  • Datasets and evaluation:
    • LRS2 and LRS3 are large public datasets of people speaking on TV shows and TED/TEDx talks.
    • They measure accuracy using Word Error Rate (WER): the percentage of words that are wrong (lower is better).
    • For detecting speaking in video, they use a benchmark called AVA-ActiveSpeaker and measure mean Average Precision (mAP): higher is better.

What they found and why it matters

  • Much better accuracy with less data:
    • On LRS2, their best model achieved a 22.6% WER—unusually low for lip reading and better than all previous models trained on public data.
    • On LRS3, they also reached strong results.
    • Importantly, they outperformed some industry models that were trained on about 10 times more data. That means their design is data-efficient.
  • The new “spotlight” helps a lot:
    • Replacing simple “average everything” pooling with attention-based pooling significantly reduced errors. The model learns to focus on the mouth and other useful areas.
  • Sub-word tokens make lip reading easier:
    • Switching from characters to sub-words improved accuracy and sped up the model by making the output shorter and more meaningful.
  • Strong video-only speech detection:
    • Their Visual Speech Detection model (video-only) beat all prior visual-only baselines on AVA-ActiveSpeaker and even outperformed several recent methods that used both audio and video. That’s impressive for silent video analysis.

Why this matters:

  • Better lip-reading models can help in noisy places (where microphones struggle), enable silent dictation, and assist people who cannot speak but can move their lips.
  • Strong video-only speech detection can automatically find speaking moments in silent films or videos without sound.

What this could lead to (impact and implications)

  • Practical tools:
    • Transcribing old silent movies and documentaries.
    • Assisting people with speech impairments by turning lip movements into text.
    • Improving speech recognition in loud environments by combining audio with visual cues.
  • More robust AI:
    • The attention mechanism for focusing on the right visual areas can be useful in other video tasks.
    • Sub-word prediction can be applied more broadly in language tasks to balance speed and accuracy.
  • Responsible use:
    • The authors note potential privacy concerns (e.g., surveillance). They point out that real-world conditions like low resolution, odd angles, and low frame rates make “secret” lip reading from far-away cameras very unreliable.
    • They plan to share code and models to support research and transparency.

Quick guide to key terms

  • Attention (visual attention): A way for the model to focus on the most important parts of an image, like shining a spotlight on the lips.
  • Sub-words (word-pieces): Small, meaningful chunks of words that help the model predict text more efficiently than single letters.
  • Word Error Rate (WER): How many words were wrong in the transcription; lower means better.
  • Visual Speech Detection (VSD): Finding when someone is speaking using only the video, no sound.
  • Transformer: A modern type of neural network that’s good at understanding sequences (like text or frames over time).

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.