Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

REWIND Dataset: Privacy-preserving Speaking Status Segmentation from Multimodal Body Movement Signals in the Wild (2403.01229v1)

Published 2 Mar 2024 in cs.CV, cs.AI, cs.LG, and eess.SP

Abstract: Recognizing speaking in humans is a central task towards understanding social interactions. Ideally, speaking would be detected from individual voice recordings, as done previously for meeting scenarios. However, individual voice recordings are hard to obtain in the wild, especially in crowded mingling scenarios due to cost, logistics, and privacy concerns. As an alternative, machine learning models trained on video and wearable sensor data make it possible to recognize speech by detecting its related gestures in an unobtrusive, privacy-preserving way. These models themselves should ideally be trained using labels obtained from the speech signal. However, existing mingling datasets do not contain high quality audio recordings. Instead, speaking status annotations have often been inferred by human annotators from video, without validation of this approach against audio-based ground truth. In this paper we revisit no-audio speaking status estimation by presenting the first publicly available multimodal dataset with high-quality individual speech recordings of 33 subjects in a professional networking event. We present three baselines for no-audio speaking status segmentation: a) from video, b) from body acceleration (chest-worn accelerometer), c) from body pose tracks. In all cases we predict a 20Hz binary speaking status signal extracted from the audio, a time resolution not available in previous datasets. In addition to providing the signals and ground truth necessary to evaluate a wide range of speaking status detection methods, the availability of audio in REWIND makes it suitable for cross-modality studies not feasible with previous mingling datasets. Finally, our flexible data consent setup creates new challenges for multimodal systems under missing modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner, “The AMI Meeting Corpus: A Pre-announcement Machine Learning for Multimodal Interaction,” Machine Learning for Multimodal Interaction SE - Lecture Notes in Computer Science, vol. 3869, pp. 28–39, 2006, iSBN: 978-3-540-32549-9. [Online]. Available: citeulike-article-id:6473361%5Cnhttp://dx.doi.org/10.1007/11677482_3
  2. C. Raman, J. Vargas-Quiros, S. Tan, E. Gedik, A. Islam, and H. Hung, “ConfLab: A Rich Multimodal Multisensor Dataset of Free-Standing Social Interactions in the Wild,” Jul. 2022, arXiv:2205.05177 [cs]. [Online]. Available: http://arxiv.org/abs/2205.05177
  3. H. Hung, E. Gedik, and L. Cabrera Quiros, “Chapter 11 - Complex conversational scene analysis using wearable sensors,” in Multimodal Behavior Analysis in the Wild, ser. Computer Vision and Pattern Recognition, X. Alameda-Pineda, E. Ricci, and N. Sebe, Eds.   Academic Press, Jan. 2019, pp. 225–245. [Online]. Available: https://www.sciencedirect.com/science/article/pii/B9780128146019000195
  4. C. Oertel, S. Scherer, and N. Campbell, “On the use of multimodal cues for the prediction of degrees of involvement in spontaneous conversation,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, no. August, pp. 1541–1544, 2011.
  5. C. Lai and G. Murray, “Predicting group satisfaction in meeting discussions,” Proceedings of the Workshop on Modeling Cognitive Processes from Multimodal Data, MCPMD 2018, 2018, iSBN: 9781450360722.
  6. C. Raman, N. R. Prabhu, and H. Hung, “Perceived Conversation Quality in Spontaneous Interactions,” Jul. 2022, arXiv:2207.05791 [cs]. [Online]. Available: http://arxiv.org/abs/2207.05791
  7. N. R. Prabhu, C. Raman, and H. Hung, “Defining and Quantifying Conversation Quality in Spontaneous Interactions,” arXiv:2009.12842 [cs], Sep. 2020, arXiv: 2009.12842. [Online]. Available: http://arxiv.org/abs/2009.12842
  8. C. Lai, J. Carletta, and S. Renals, “Modelling Participant Affect in Meetings with Turn-Taking Features,” Workshop on Affective Social Speech Signals, no. August 2013, 2013.
  9. H. Joo, T. Simon, M. Cikara, and Y. Sheikh, “Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in A Triadic Interaction,” in Computer Vision and Pattern Recognition (CVPR), 2019, arXiv: 1906.04158. [Online]. Available: http://arxiv.org/abs/1906.04158
  10. C. Raman, H. Hung, and M. Loog, “Social Processes: Self-Supervised Meta-Learning over Conversational Groups for Forecasting Nonverbal Social Cues,” Aug. 2022, arXiv:2107.13576 [cs]. [Online]. Available: http://arxiv.org/abs/2107.13576
  11. L. Cabrera-Quiros, A. Demetriou, E. Gedik, L. van der Meij, and H. Hung, “The MatchNMingle Dataset: A Novel Multi-Sensor Resource for the Analysis of Social Interactions and Group Dynamics In-the-Wild During Free-Standing Conversations and Speed Dates,” IEEE Transactions on Affective Computing, vol. 12, no. 1, pp. 113–130, 2018.
  12. T. Choudhury and A. Pentland, “Sensing and modeling human networks using the sociometer,” in Seventh IEEE International Symposium on Wearable Computers, 2003. Proceedings., Oct. 2003, pp. 216–222, iSSN: 1530-0811.
  13. X. Alameda-Pineda, J. Staiano, R. Subramanian, L. Batrinca, E. Ricci, B. Lepri, O. Lanz, and N. Sebe, “SALSA: A Novel Dataset for Multimodal Group Behavior Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1707–1720, 2016.
  14. C. Martella, E. Gedik, L. Cabrera-quiros, G. Englebienne, H. Hung, I. Tecnológico, D. C. Rica, and C. Rica, “How Was It ? Exploiting Smartphone Sensing to Measure Implicit Audience Responses to Live Performances Categories and Subject Descriptors,” in Proceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 201–210.
  15. O. Lederman, D. Calacci, A. MacMullen, D. C. Fehder, F. E. Murray, and A. S. Pentland, “Open Badges: A Low-Cost Toolkit for Measuring Team Communication and Dynamics,” 2017, arXiv: 1710.01842. [Online]. Available: http://arxiv.org/abs/1710.01842
  16. A. Lascarides and M. Stone, “A formal semantic analysis of gesture,” Journal of Semantics, 2009. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.3741
  17. D. McNeill, “Hand and Mind: What Gestures Reveal About Thought,” University of Chicago Press, 1994.
  18. L. Cabrera-Quiros, D. M. J. Tax, and H. Hung, “Gestures In-The-Wild: Detecting Conversational Hand Gestures in Crowded Scenes Using a Multimodal Fusion of Bags of Video Trajectories and Body Worn Acceleration,” IEEE Transactions on Multimedia, vol. 22, no. 1, pp. 138–147, Jan. 2020.
  19. J. Vargas and H. Hung, “CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection,” in Working Notes Proceedings of the MediaEval 2019 Workshop, 2019, pp. 11–13.
  20. X. Wang, J. Zhu, and O. Scharenborg, “Multimodal Fusion of Body Movement Signals for No-audio Speech Detection,” in Working Notes Proceedings of the MediaEval 2020 Workshop, 2020, p. 3.
  21. J. Vargas-Quiros, L. Cabrera-Quiros, C. Oertel, and H. Hung, “Impact of Annotation Modality on Label Quality and Model Performance in the Automatic Assessment of Laughter In-the-Wild,” IEEE Transactions on Affective Computing, pp. 1–17, 2023, conference Name: IEEE Transactions on Affective Computing. [Online]. Available: https://ieeexplore.ieee.org/document/10136533
  22. Y. Kong and Y. Fu, “Human Action Recognition and Prediction: A Survey,” International Journal of Computer Vision, vol. 130, no. 5, pp. 1366–1401, May 2022. [Online]. Available: https://doi.org/10.1007/s11263-022-01594-9
  23. C. Beyan, M. Shahid, and V. Murino, “RealVAD: A Real-world Dataset and A Method for Voice Activity Detection by Body Motion Analysis,” x, vol. 9210, no. c, pp. 1–16, 2020.
  24. C. Beyan, F. Capozzi, C. Becchio, and V. Murino, “Multi-task Learning of Social Psychology Assessments and Nonverbal Features for Automatic Leadership Identification,” in ICMI, 2017, pp. 451–455, issue: 5. [Online]. Available: https://doi.org/10.1145/3136755.3136812
  25. L. Cabrera-quiros, E. Gedik, and H. Hung, “Transductive Parameter Transfer , Bags of Dense Trajectories and MILES for No-Audio Multimodal Speech Detection,” in Working Notes Proceedings of the MediaEval 2018 Workshop, 2018.
  26. M. Cristani, A. Pesarin, A. Vinciarelli, M. Crocco, and V. Murino, “Look at who’s talking: Voice activity detection by automated gesture analysis,” in Constructing ambient intelligence, R. Wichert, K. Van Laerhoven, and J. Gelissen, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 72–80.
  27. E. Gedik and H. Hung, “Personalised models for speech detection from body movements using transductive parameter transfer,” Personal and Ubiquitous Computing, vol. 21, no. 4, pp. 723–737, 2017.
  28. Y. Chen, J. Bi, and J. Z. Wang, “MILES: Multiple-instance learning via embedded instance selection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 1931–1947, 2006, publisher: IEEE ISBN: 0162-8828 VO - 28.
  29. A. Matic, V. Osmani, and O. Mayora, “Automatic Sensing of Speech Activity and Correlation with Mood Changes,” in Pervasive and Mobile Sensing and Computing for Healthcare: Technological and Social Issues, ser. Smart Sensors, Measurement and Instrumentation, S. C. Mukhopadhyay and O. A. Postolache, Eds.   Berlin, Heidelberg: Springer, 2013, pp. 195–205. [Online]. Available: https://doi.org/10.1007/978-3-642-32538-0_9
  30. E. Gedik, L. Cabrera-Quiros, and H. Hung, “No-Audio Multimodal Speech Detection task at MediaEval 2019,” in Working Notes Proceedings of the MediaEval 2019 Workshop, 2019, p. 3.
  31. J. Vargas-Quiros, L. Cabrera-Quiros, and H. Hung, “No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration,” Nov. 2022, arXiv:2211.00549 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2211.00549
  32. V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “PoTion: Pose MoTion Representation for Action Recognition,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 7024–7033, 2018, iSBN: 9781538664209.
  33. A. Yan, Y. Wang, Z. Li, and Y. Qiao, “PA3D: Pose-action 3D machine for video recognition,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2019-June, pp. 7914–7923, 2019, iSBN: 9781728132938.
  34. H. Hung, G. Englebienne, and L. Cabrera Quiros, “Detecting conversing groups with a single worn accelerometer,” Proceedings of the 16th International Conference on Multimodal Interaction - ICMI ’14, pp. 84–91, 2014, iSBN: 9781450328852. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2663204.2663228
  35. H. Hung, G. Englebienne, and J. Kools, “Classifying social actions with a single accelerometer,” Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing - UbiComp ’13, p. 207, 2013, iSBN: 9781450317702. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2493432.2493513
  36. M. Shahid, C. Beyan, and V. Murino, “Voice activity detection by upper body motion analysis and unsupervised domain adaptation,” Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019, pp. 1260–1269, 2019, iSBN: 9781728150239.
  37. P. Chakravarty and T. Tuytelaars, “Cross-modal supervision for learning active speaker detection in video,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9909 LNCS, pp. 285–301, 2016, arXiv: 1603.08907 ISBN: 9783319464534.
  38. Y. Hu, J. S. Ren, J. Dai, C. Yuan, L. Xu, and W. Wang, “Deep multimodal speaker naming,” MM 2015 - Proceedings of the 2015 ACM Multimedia Conference, no. July, pp. 1107–1110, 2015, arXiv: 1507.04831 ISBN: 9781450334594.
  39. J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. Xi, and C. Pantofaru, “Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 4492–4496, iSSN: 2379-190X. [Online]. Available: https://ieeexplore.ieee.org/document/9053900
  40. H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote.Audio: Neural Building Blocks for Speaker Diarization,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020, 2020, arXiv: 1911.01255v1.
  41. M. Lavechin, M.-P. Gill, R. Bousbib, H. Bredin, and L. P. Garcia-Perera, “End-to-end Domain-Adversarial Voice Activity Detection,” in Proc. Interspeech 2020, 2020, pp. 3685–3689, arXiv: 1910.10655. [Online]. Available: http://arxiv.org/abs/1910.10655
  42. Z. H. Tan, A. k. Sarkar, and N. Dehak, “rVAD: An unsupervised segment-based robust voice activity detection method,” Computer Speech and Language, vol. 59, pp. 1–21, 2020.
  43. FFmpeg, “ffmpeg tool,” 2016. [Online]. Available: http://ffmpeg.org/
  44. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in ICASSP 2021, 2021.
  45. G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. Le Roux, “WHAM!: Extending speech separation to noisy environments,” in Proc. Interspeech, Sep. 2019.
  46. O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V. Lavrukhin, J. Cook, and others, “Nemo: a toolkit for building ai applications using neural modules,” arXiv preprint arXiv:1909.09577, 2019.
  47. Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu, “Pose flow: Efficient online pose tracking,” British Machine Vision Conference 2018, BMVC 2018, pp. 1–12, 2019, arXiv: 1802.00977.
  48. H. Fan, T. Murrell, H. Wang, K. V. Alwala, Y. Li, Y. Li, B. Xiong, N. Ravi, M. Li, H. Yang, J. Malik, R. Girshick, M. Feiszli, A. Adcock, W.-Y. Lo, and C. Feichtenhofer, “PyTorchVideo: A deep learning library for video understanding,” in Proceedings of the 29th ACM international conference on multimedia, 2021.
  49. I. Oguiza, “tsai - A state-of-the-art deep learning library for time series and sequential data,” 2022. [Online]. Available: https://github.com/timeseriesAI/tsai
  50. Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition,” arXiv:2003.14111 [cs], May 2020, arXiv: 2003.14111. [Online]. Available: http://arxiv.org/abs/2003.14111
  51. H. J. Griffin, M. S. Aung, B. Romera-Paredes, C. McLoughlin, G. McKeown, W. Curran, and N. Bianchi-Berthouze, “Laughter Type Recognition from Whole Body Motion,” in 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Sep. 2013, pp. 349–355, iSSN: 2156-8111.
  52. S. K. Maynard, “Interactional functions of a nonverbal sign Head movement in japanese dyadic casual conversation,” Journal of Pragmatics, vol. 11, no. 5, pp. 589–606, Oct. 1987. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0378216687901810
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jose Vargas Quiros (3 papers)
  2. Chirag Raman (19 papers)
  3. Stephanie Tan (5 papers)
  4. Ekin Gedik (3 papers)
  5. Laura Cabrera-Quiros (4 papers)
  6. Hayley Hung (18 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.