Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Decoding visemes: improving machine lipreading (1710.01288v1)

Published 3 Oct 2017 in cs.CV and eess.AS

Abstract: Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Lee's is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Lee's. Our results show the sensitivity of phoneme clustering and we use our new knowledge to augment a conventional MLR system. It has been observed in MLR, that classifiers need training on test subjects to achieve accuracy. Thus machine lipreading is highly speaker-dependent. Conversely speaker independence is robust classification of non-training speakers. We investigate the dependence of phoneme-to-viseme maps between speakers and show there is not a high variability of visemes, but there is high variability in trajectory between visemes of individual speakers with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. We show that prior phoneme-to-viseme maps rarely have enough visemes and the optimal size, which varies by speaker, ranges from 11-35. Finally we decode from visemes back to phonemes and into words. Our novel approach uses the optimum range visemes within hierarchical training of phoneme classifiers and demonstrates a significant increase in classification accuracy.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Helen L Bear (9 papers)

Summary

We haven't generated a summary for this paper yet.