Leveraging Speech for Gesture Detection in Multimodal Communication (2404.14952v1)
Abstract: Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate backbone models for each modality to address the temporal misalignment and sampling rate differences. We utilize Transformer encoders in cross-modal and early fusion techniques to effectively align and integrate speech and skeletal sequences. The study results show that combining visual and speech information significantly enhances gesture detection performance. Our findings indicate that expanding the speech buffer beyond visual time segments improves performance and that multimodal integration using cross-modal and early fusion techniques outperforms baseline methods using unimodal and late fusion methods. Additionally, we find a correlation between the models' gesture prediction confidence and low-level speech frequency features potentially associated with gestures. Overall, the study provides a better understanding and detection methods for co-speech gestures, facilitating the analysis of multimodal communication.
- Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1165–1174.
- Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
- Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 423–443.
- IPN hand: A video dataset and benchmark for real-time continuous hand gesture recognition. In 2020 25th international conference on pattern recognition (ICPR). IEEE, 4340–4347.
- Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the 29th ACM International Conference on Multimedia. 2027–2036.
- Timing relationships between representational gestures and speech: A corpus based investigation. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 44. University of California, 2052–2058.
- Linda Drijvers and Asli Özyürek. 2017. Visual context enhanced: The joint contribution of iconic gestures and visible speech to degraded speech comprehension. Journal of Speech, Language, and Hearing Research 60, 1 (2017), 212–222.
- The CABB dataset: A multimodal corpus of communicative interactions for behavioural and neural analyses. NeuroImage 264 (2022), 119734.
- Multi-modal gesture recognition challenge 2013: Dataset and results. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction. 445–452.
- Florian Eyben and Björn Schuller. 2015. openSMILE:) The Munich open-source large-scale multimedia feature extractor. ACM SIGMultimedia Records 6, 4 (2015), 4–13.
- Co-Speech Gesture Detection through Multi-phase Sequence Labeling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. CVF/IEEE, Hawaai, USA.
- Skeleton-based explainable bodily expressed emotion recognition through graph convolutional networks. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). IEEE, 1–8.
- Joint modelling of audio-visual cues using attention mechanisms for emotion recognition. Multimedia Tools and Applications 82, 8 (2023), 11239–11264.
- Human-machine interaction sensing technology based on hand gesture recognition: A review. IEEE Transactions on Human-Machine Systems 51, 4 (2021), 300–309.
- Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79–86.
- CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131–135.
- Judith Holler and Stephen C Levinson. 2019. Multimodal language processing in human communication. Trends in Cognitive Sciences 23, 8 (2019), 639–652.
- Judith Holler and Katie Wilkin. 2011. Co-speech gesture mimicry in the process of collaborative referring during face-to-face dialogue. Journal of Nonverbal Behavior 35 (2011), 133–153.
- Judith Holler and Katie Wilkin. 2020. Communicating common ground: How mutually shared knowledge influences speech and gesture in a narrative task. In Speech Accompanying-Gesture. Psychology Press, 267–289.
- Skeleton aware multi-modal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3413–3423.
- Adam Kendon. 2004. Gesture units, gesture phrases and speech. In Gesture: Visible Action as Utterance. Cambridge University Press, Chapter 7, 108–126. https://doi.org/10.1017/CBO9780511807572.007
- Transformers in vision: A survey. ACM computing surveys (CSUR) 54, 10s (2022), 1–41.
- Online dynamic hand gesture recognition including efficiency analysis. IEEE Transactions on Biometrics, Behavior, and Identity Science 2, 2 (2020), 85–97.
- Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. 97–104.
- Multimodal analysis of the predictability of hand-gesture properties. In 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2022, Auckland, New Zealand, May 9-13, 2022. ACM Press, 770–779.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
- LD-ConGR: A large RGB-D video dataset for long-distance continuous gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3304–3312.
- Matthias Mauch and Simon Dixon. 2014. pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In 2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 659–663.
- librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8.
- David McNeill. 1992. Hand and mind. Advances in Visual Semiotics 351 (1992).
- Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4207–4215.
- Shamima Najnin and Bonny Banerjee. 2019. Speech recognition using cepstral articulatory features. Speech Communication 107 (2019), 26–37.
- Moddrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 8 (2015), 1692–1706.
- A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation. In Computer Graphics Forum, Vol. 42. 569–596.
- Energy flows in gesture-speech physics: The respiratory-vocal system and its coupling with hand gestures. The Journal of the Acoustical Society of America 148, 3 (2020), 1231–1247.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
- Is cross-attention preferable to self-attention for multi-modal emotion recognition?. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4693–4697.
- The primacy of multimodal alignment in converging on shared symbols for novel referents. Discourse Processes 59, 3 (2022), 209–236.
- Grad-CAM: Why did you say that? arXiv preprint arXiv:1611.07450 (2016).
- mm-Pose: Real-time human skeletal posture estimation using mmWave radars and CNNs. IEEE Sensors Journal 20, 17 (2020), 10032–10044.
- Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction. In International Conference on Learning Representations.
- Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition. In Proceedings of the 28th ACM International Conference on Multimedia. 1625–1633.
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics.
- Hand Gestures Have Predictive Potential During Conversation: An Investigation of the Timing of Gestures in Relation to Speech. Cognitive Science 48, 1 (2024), e13407.
- Speakers exhibit a multimodal Lombard effect in noise. Scientific reports 11, 1 (2021), 16721.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Gesture and speech in interaction: An overview. , 209–232 pages.
- Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In Proceedings of the IEEE international conference on computer vision workshops. 3138–3146.
- Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE transactions on pattern analysis and machine intelligence 38, 8 (2016), 1583–1597.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
- Sequence-to-sequence predictive model: From prosody to communicative gestures. In International Conference on Human-Computer Interaction. Springer, 355–374.
- Continuous gesture segmentation and recognition using 3DCNN and convolutional LSTM. IEEE Transactions on Multimedia 21, 4 (2018), 1011–1021.
- Esam Ghaleb (6 papers)
- Ilya Burenko (4 papers)
- Marlou Rasenberg (4 papers)
- Wim Pouw (5 papers)
- Ivan Toni (3 papers)
- Peter Uhrig (5 papers)
- Anna Wilson (2 papers)
- Aslı Özyürek (5 papers)
- Raquel Fernández (52 papers)
- Judith Holler (4 papers)