Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GestSync: Determining who is speaking without a talking head (2310.05304v1)

Published 8 Oct 2023 in cs.CV

Abstract: In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person's gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement than there is between voice and lip motion. We introduce a dual-encoder model for this task, and compare a number of input representations including RGB frames, keypoint images, and keypoint vectors, assessing their performance and advantages. We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset. Finally, we demonstrate applications of Gesture-Sync for audio-visual synchronisation, and in determining who is the speaker in a crowd, without seeing their faces. The code, datasets and pre-trained models can be found at: \url{https://www.robots.ox.ac.uk/~vgg/research/gestsync}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Towards understanding the relation between gestures and language. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5507–5520, 2022.
  2. Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018.
  3. Deep audio-visual speech recognition. IEEE PAMI, 2019.
  4. Now you’re speaking my language: Visual language identification. In INTERSPEECH, 2020a.
  5. Self-supervised learning of audio-visual objects from video. In Proc. ECCV, 2020b.
  6. Gesture’s neural language. In Front. Psychology, 2012.
  7. Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
  8. Automatic segmentation of sign language into subtitle-units. In ECCV Workshops, 2020.
  9. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  10. Audio-visual synchronisation in the wild. In Proc. BMVC, 2021.
  11. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
  12. Perfect match: Improved cross-modal embeddings for audio-visual synchronisation. In Proc. ICASSP, 2019.
  13. Spatial-temporal graph convolutional networks for sign language recognition. In Artificial Neural Networks and Machine Learning–ICANN, pages 646–657. Springer, 2019.
  14. Learning individual styles of conversational gesture. In Computer Vision and Pattern Recognition (CVPR), 2019.
  15. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7297–7306, 2018.
  16. Dynamic temporal alignment of speech to lips. In Proc. ICASSP, 2019.
  17. Sparse in space and time: Audio-visual synchronisation with trainable selectors. In Proc. BMVC, 2022.
  18. VocaLiST: An audio-visual synchronisation model for lips and voices. arXiv preprint arXiv:2204.02090, 2022.
  19. On attention modules for audio-visual synchronization. In Workshop on Sight and Sound, CVPR, 2019.
  20. End-to-end lip synchronisation based on pattern classification. In SLT Workshop, 2021.
  21. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  22. A two-stream neural network for pose-based hand gesture recognition. IEEE Transactions on Cognitive and Developmental Systems, 14(4):1594–1603, 2021.
  23. Real-time human pose estimation from video with convolutional neural networks. arXiv preprint arXiv:1609.07420, 2016.
  24. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
  25. David Mcneill. Hand and mind: What gestures reveal about thought. In Bibliovault OAI Repository, the University of Chicago Press, volume 27, 1994.
  26. Lu Meng and Ronghui Li. An attention-enhanced multi-scale and dual sign language recognition network based on a graph convolution network. Sensors, 21(4):1120, 2021.
  27. Flowing convnets for human pose estimation in videos. In Proc. ICCV, 2015.
  28. Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. NeurIPS, 2000.
  29. The Linguistics of British Sign Language: An Introduction. Cambridge University Press, 1999.
  30. Isolated sign language recognition with multi-scale spatial-temporal graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3462–3471, 2021.
  31. ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 761–769, 2016.
  32. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com