GestSync: Determining who is speaking without a talking head (2310.05304v1)
Abstract: In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person's gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement than there is between voice and lip motion. We introduce a dual-encoder model for this task, and compare a number of input representations including RGB frames, keypoint images, and keypoint vectors, assessing their performance and advantages. We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset. Finally, we demonstrate applications of Gesture-Sync for audio-visual synchronisation, and in determining who is the speaker in a crowd, without seeing their faces. The code, datasets and pre-trained models can be found at: \url{https://www.robots.ox.ac.uk/~vgg/research/gestsync}.
- Towards understanding the relation between gestures and language. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5507–5520, 2022.
- Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018.
- Deep audio-visual speech recognition. IEEE PAMI, 2019.
- Now you’re speaking my language: Visual language identification. In INTERSPEECH, 2020a.
- Self-supervised learning of audio-visual objects from video. In Proc. ECCV, 2020b.
- Gesture’s neural language. In Front. Psychology, 2012.
- Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
- Automatic segmentation of sign language into subtitle-units. In ECCV Workshops, 2020.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- Audio-visual synchronisation in the wild. In Proc. BMVC, 2021.
- Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
- Perfect match: Improved cross-modal embeddings for audio-visual synchronisation. In Proc. ICASSP, 2019.
- Spatial-temporal graph convolutional networks for sign language recognition. In Artificial Neural Networks and Machine Learning–ICANN, pages 646–657. Springer, 2019.
- Learning individual styles of conversational gesture. In Computer Vision and Pattern Recognition (CVPR), 2019.
- Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7297–7306, 2018.
- Dynamic temporal alignment of speech to lips. In Proc. ICASSP, 2019.
- Sparse in space and time: Audio-visual synchronisation with trainable selectors. In Proc. BMVC, 2022.
- VocaLiST: An audio-visual synchronisation model for lips and voices. arXiv preprint arXiv:2204.02090, 2022.
- On attention modules for audio-visual synchronization. In Workshop on Sight and Sound, CVPR, 2019.
- End-to-end lip synchronisation based on pattern classification. In SLT Workshop, 2021.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- A two-stream neural network for pose-based hand gesture recognition. IEEE Transactions on Cognitive and Developmental Systems, 14(4):1594–1603, 2021.
- Real-time human pose estimation from video with convolutional neural networks. arXiv preprint arXiv:1609.07420, 2016.
- Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
- David Mcneill. Hand and mind: What gestures reveal about thought. In Bibliovault OAI Repository, the University of Chicago Press, volume 27, 1994.
- Lu Meng and Ronghui Li. An attention-enhanced multi-scale and dual sign language recognition network based on a graph convolution network. Sensors, 21(4):1120, 2021.
- Flowing convnets for human pose estimation in videos. In Proc. ICCV, 2015.
- Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. NeurIPS, 2000.
- The Linguistics of British Sign Language: An Introduction. Cambridge University Press, 1999.
- Isolated sign language recognition with multi-scale spatial-temporal graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3462–3471, 2021.
- ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 761–769, 2016.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.