Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Co-Speech Gesture Detection through Multi-Phase Sequence Labeling (2308.10680v2)

Published 21 Aug 2023 in cs.CV

Abstract: Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and retraction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline models in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework's capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. IPN Hand: A video dataset and benchmark for real-time continuous hand gesture recognition. In 25th international conference on pattern recognition (ICPR), pages 4340–4347. IEEE, 2021.
  2. The CABB dataset: A multimodal corpus of communicative interactions for behavioural and neural analyses. NeuroImage, 264:119734, 2022.
  3. G David Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
  4. Skeleton-based explainable bodily expressed emotion recognition through graph convolutional networks. In 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), 2021.
  5. Human-machine interaction sensing technology based on hand gesture recognition: A review. IEEE Transactions on Human-Machine Systems, 51(4):300–309, 2021.
  6. Skeleton aware multi-modal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3413–3423, 2021.
  7. Sequence labeling for parts of speech and named entities. In Speech and Language Processing, chapter 8, pages 168–193. Internet: https://web.stanford.edu/ jurafsky/slp3/,(Accessed: Oct 30, 2023), 2023.
  8. Adam Kendon. Gesture units, gesture phrases and speech. In Gesture: Visible Action as Utterance, chapter 7, pages 108–126. Cambridge University Press, 2004.
  9. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  10. Towards subject independent continuous sign language recognition: A segment and merge approach. Pattern Recognition, 47(3):1294–1308, 2014.
  11. Online dynamic hand gesture recognition including efficiency analysis. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2(2):85–97, 2020.
  12. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, 2001, pages 282–289, 2001.
  13. 3D skeletal gesture recognition via hidden states exploration. IEEE Transactions on Image Processing, 29:4583–4597, 2020.
  14. David McNeill. Hand and mind. University of Chicago Press, 1992.
  15. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4207–4215, 2016.
  16. Latent-dynamic discriminative models for continuous gesture recognition. In 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.
  17. The primacy of multimodal alignment in converging on shared symbols for novel referents. Discourse Processes, 59(3):209–236, 2022.
  18. Gesture phase segmentation dataset: An extension for development of gesture analysis models. Journal of Internet Services and Information Security, 12(4):139–155, 2022.
  19. mm-Pose: Real-time human skeletal posture estimation using mmwave radars and CNNs. IEEE Sensors Journal, 20(17):10032–10044, 2020.
  20. Continuous body and hand gesture recognition for natural human-computer interaction. ACM Transactions on Interactive Intelligent Systems (TiiS), 2(1):1–28, 2012.
  21. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(11), 2008.
  22. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  23. Tener: adapting transformer encoder for named entity recognition. arXiv preprint arXiv:1911.04474, 2019.
  24. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  25. Continuous gesture segmentation and recognition using 3D CNN and convolutional LSTM. IEEE Transactions on Multimedia, 21(4):1011–1021, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Esam Ghaleb (6 papers)
  2. Ilya Burenko (4 papers)
  3. Marlou Rasenberg (4 papers)
  4. Wim Pouw (5 papers)
  5. Peter Uhrig (5 papers)
  6. Ivan Toni (3 papers)
  7. Aslı Özyürek (5 papers)
  8. Raquel Fernández (52 papers)
  9. Judith Holler (4 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.