Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 150 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 105 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Supporting Experts with a Multimodal Machine-Learning-Based Tool for Human Behavior Analysis of Conversational Videos (2402.11145v1)

Published 17 Feb 2024 in cs.HC, cs.CV, and cs.LG

Abstract: Multimodal scene search of conversations is essential for unlocking valuable insights into social dynamics and enhancing our communication. While experts in conversational analysis have their own knowledge and skills to find key scenes, a lack of comprehensive, user-friendly tools that streamline the processing of diverse multimodal queries impedes efficiency and objectivity. To solve it, we developed Providence, a visual-programming-based tool based on design considerations derived from a formative study with experts. It enables experts to combine various machine learning algorithms to capture human behavioral cues without writing code. Our study showed its preferable usability and satisfactory output with less cognitive load imposed in accomplishing scene search tasks of conversations, verifying the importance of its customizability and transparency. Furthermore, through the in-the-wild trial, we confirmed the objectivity and reusability of the tool transform experts' workflow, suggesting the advantage of expert-AI teaming in a highly human-contextual domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. J. K. Aggarwal and Michael S. Ryoo. 2011. Human activity analysis: A review. Comput. Surveys 43, 3 (2011), 16:1–16:43. https://doi.org/10.1145/1922649.1922653
  2. EduSense: Practical Classroom Sensing at Scale. Proceedings of the ACM on Interactive, Mobile, Wearable, and Ubiquitous Technologies 3, 3 (2019), 71:1–71:26. https://doi.org/10.1145/3351229
  3. Efficient Retrieval of Life Log Based on Context and Content. In Proceedings of the the 1st ACM Workshop on Continuous Archival and Retrieval of Personal Experiences. ACM, New York, NY. https://doi.org/10.1145/1026653.1026656
  4. Memento: A Prototype Lifelog Search Engine for LSC’21. In Proceedings of the 4th Annual on Lifelog Search Challenge. ACM, New York, NY, 53–58. https://doi.org/10.1145/3463948.3469069
  5. Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing. Multimedia Tools and Applications 14, 2 (2001), 137–151. https://doi.org/10.1023/A:1011395131992
  6. Guidelines for Human-AI Interaction. In Proceedings of the 2019 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 3:1–3:13. https://doi.org/10.1145/3290605.3300233
  7. Shintaro Ando and Hiromasa Fujihara. 2021. Construction of a Large-Scale Japanese ASR Corpus on TV Recordings. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, New York, NY, 6948–6952. https://doi.org/10.1109/ICASSP39728.2021.9413425
  8. LemurDx: Using Unconstrained Passive Sensing for an Objective Measurement of Hyperactivity in Children with no Parent Input. Proceedings of the ACM on Interactive, Mobile, Wearable, and Ubiquitous Technologies 7, 2 (2023), 46:1–46:23. https://doi.org/10.1145/3596244
  9. Riku Arakawa and Hiromu Yakura. 2019. REsCUE: A framework for REal-time feedback on behavioral CUEs using multimodal anomaly detection. In Proceedings of the 2019 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 572. https://doi.org/10.1145/3290605.3300802
  10. Riku Arakawa and Hiromu Yakura. 2020. INWARD: A Computer-Supported Tool for Video-Reflection Improves Efficiency and Effectiveness in Executive Coaching. In Proceedings of the 2020 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 574:1–574:13. https://doi.org/10.1145/3313831.3376703
  11. Riku Arakawa and Hiromu Yakura. 2023. AI for Human Assessment: What Do Professional Assessors Need?. In Extended Abstracts of the 2023 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 378:1–378:7. https://doi.org/10.1145/3544549.3573849
  12. TeamSpiritous - A Retrospective Emotional Competence Development System for Video-Meetings. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–28. https://doi.org/10.1145/3555117
  13. What Makes Speech Sound Fluent? The Contributions of Pauses, Speed and Repairs. Language Testing 30, 2 (2013), 159–175. https://doi.org/10.1177/0265532212455394
  14. pyannote.audio: Neural Building Blocks for Speaker Diarization. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, New York, NY, 7124–7128. https://doi.org/10.1109/ICASSP40776.2020.9052974
  15. John Brooke. 1996. SUS: A ‘Quick and Dirty’ Usability Scale. In Usability Evaluation In Industry, Patrick W. Jordan, B. Thomas, Ian Lyall McClelland, and Bernard Weerdmeester (Eds.). CRC Press, London, UK, 207–212.
  16. Social Signal Processing. Cambridge University Press, Cambridge, UK. https://doi.org/10.1017/9781316676202
  17. Margaret M. Burnett. 1999. Visual Programming. In Encyclopedia of Electrical and Electronics Engineering, John G. Webster (Ed.). John Wiley & Sons Inc., 275–283. https://doi.org/10.1002/047134608X.W1707
  18. Teachable Machine: Approachable Web-Based Tool for Exploring Machine Learning Classification. In Extended Abstracts of the 2020 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 1–8. https://doi.org/10.1145/3334480.3382839
  19. Herbert H. Clark. 2006. Pauses and Hesitations: Psycholinguistic Approach. In Encyclopedia of Language & Linguistics. 244–248. https://doi.org/10.1016/b0-08-044854-2/00796-3
  20. Augmenting Social Interactions: Realtime Behavioural Feedback using Social Signal Processing Techniques. In Proceedings of the 2015 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 565–574. https://doi.org/10.1145/2702123.2702314
  21. Charles Darwin. 1872. The Expression of the Emotions in Man and Animals. John Murray, London, UK.
  22. Should ‘uh’ and ‘um’ be Categorized as Markers of Disfluency? The Use of Fillers in a Challenging Conversational Context. Fluency and Disfluency across Languages and Language Varieties 4 (2019), 67.
  23. Towards Multi-Modal Conversational Information Seeking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 1577–1587. https://doi.org/10.1145/3404835.3462806
  24. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVF/IEEE, Washington, DC, 5202–5211. https://doi.org/10.1109/CVPR42600.2020.00525
  25. Bella M. DePaulo. 1992. Nonverbal Behavior and Self-Presentation. Psychological Bulletin 111, 2 (1992), 203–243. https://doi.org/10.1037/0033-2909.111.2.203
  26. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, Stroudsburg, PA, 4171–4186. https://doi.org/10.18653/v1/n19-1423
  27. Using the Influence Model to Recognize Functional Roles in Meetings. In Proceedings of the 9th International Conference on Multimodal Interfaces. ACM, New York, NY, 271–278. https://doi.org/10.1145/1322192.1322239
  28. Ruofei Du, Na Li, Jing Jin, Michelle Carney, Scott Miles, Maria Kleiner, Xiuxiu Yuan, Yinda Zhang, Anuva Kulkarni, Xingyu “Bruce” Liu, Sergio Orts Escolano, Abhishek Kar, Ping Yu, Ram Iyengar, Adarsh Kowdle, and Alex Olwal. 2023. Rapsai: Accelerating Machine Learning Prototyping of Multimedia Applications through Visual Programming. In Proceedings of the 2023 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 23 pages. https://doi.org/10.1145/3544548.3581338
  29. Emodash: A Dashboard Supporting Retrospective Awareness of Emotions in Online Learning. International Journal on Human-Computer Studies 139 (2020), 102411. https://doi.org/10.1016/j.ijhcs.2020.102411
  30. RMPE: Regional Multi-person Pose Estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision. IEEE Computer Society, Washington, DC, 2353–2362. https://doi.org/10.1109/ICCV.2017.256
  31. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 17 pages.
  32. RT-GENE: Real-Time Eye Gaze Estimation in Natural Environments. In Proceedings of the 15th European Conference on Computer Vision. Springer, Cham, Switzerland, 339–357. https://doi.org/10.1007/978-3-030-01249-6_21
  33. ReMap: Lowering the Barrier to Help-Seeking with Multimodal Search. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. ACM, New York, NY, 979–986. https://doi.org/10.1145/3379337.3415592
  34. Rekall: Specifying Video Events using Compositions of Spatiotemporal Labels. arXiv abs/1910.02993 (2019), 16 pages. https://doi.org/10.48550/arXiv.1910.02993
  35. Daniel Gatica-Perez. 2006. Analyzing Group Interactions in Conversations: a Review. In Proceedings of the 2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. IEEE, New York, NY, 41–46. https://doi.org/10.1109/MFI.2006.265658
  36. Democratized Image Analytics by Visual Programming through Integration of Deep Models and Small-Scale Machine Learning. Nature Communications 10, 1 (2019). https://doi.org/10.1038/s41467-019-12397-x
  37. Lessons Learned from Designing an AI-Enabled Diagnosis Tool for Pathologists. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (2021), 10:1–10:25. https://doi.org/10.1145/3449084
  38. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association. ISCA, Baixas, France, 5036–5040. https://doi.org/10.21437/Interspeech.2020-3015
  39. Proceedings of the 2018 ACM Workshop on the Lifelog Search Challenge. ACM, New York, NY.
  40. Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. Vol. 52. 139–183. https://doi.org/10.1016/s0166-4115(08)62386-9
  41. Assistive Video Filters for People with Parkinson’s Disease to Remove Tremors and Adjust Voice. In Proceedings of the 10th International Conference on Affective Computing and Intelligent Interaction. IEEE, New York, NY, 1–8. https://doi.org/10.1109/ACII55700.2022.9953845
  42. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Washington, DC, 770–778. https://doi.org/10.1109/CVPR.2016.90
  43. MACH: My Automated Conversation Coach. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, New York, NY, 697–706. https://doi.org/10.1145/2493432.2493502
  44. Eric Horvitz. 1999. Principles of Mixed-Initiative User Interfaces. In Proceeding of the 1999 ACM SIGCHI Conference on Human Factors in Computing Systems, Marian G. Williams and Mark W. Altom (Eds.). ACM, New York, NY, 159–166. https://doi.org/10.1145/302979.303030
  45. Towards Understanding Successful Novice Example Use in Blocks-Based Programming. Journal of Visual Language and Sentient Systems 3 (2017), 101–118.
  46. Yuki Koyama and Takeo Igarashi. 2018. Computational Design with Crowds. In Computational Interaction. Oxford University Press, 153–184. https://doi.org/10.1093/oso/9780198799603.003.0007
  47. Hierarchical Summarization for Longform Spoken Dialog. In Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology. ACM, New York, NY, 582–597. https://doi.org/10.1145/3472749.3474771
  48. The Promise of Social Signal Processing for Research on Decision-Making in Entrepreneurial Contexts. Small Business Economics 55, 3 (2019), 589–605. https://doi.org/10.1007/s11187-019-00205-1
  49. Microsoft COCO: Common Objects in Context. In Proceedings of the 13th European Conference on Computer Vision. Springer, Cham, Switzerland, 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
  50. DeepLabCut: Markerless Pose Estimation of User-Defined Body Parts with Deep Learning. Nature Neuroscience 21, 9 (2018), 1281–1289. https://doi.org/10.1038/s41593-018-0209-y
  51. SOMHunter for Lifelog Search. In Proceedings of the 3rd ACM Workshop on Lifelog Search Challenge. ACM, New York, NY, 73–75. https://doi.org/10.1145/3379172.3391727
  52. Evaluating the Impact of Human Explanation Strategies on Human-AI Visual Decision-Making. Proceedings of the ACM on Human-Computer Interaction 7, CSCW1 (2023), 1–37. https://doi.org/10.1145/3579481
  53. VERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval. In Proceedings of the 22nd International Conference on MultiMedia Modeling. Springer, Cham, Switzerland, 394–399. https://doi.org/10.1007/978-3-319-27674-8_39
  54. Predicting Influential Statements in Group Discussions using Speech and Head Motion Information. In Proceedings of the 16th International Conference on Multimodal Interaction. ACM, New York, NY, 136–143. https://doi.org/10.1145/2663204.2663248
  55. Predicting Meeting Extracts in Group Discussions Using Multimodal Convolutional Neural Networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, New York, NY, 421–425. https://doi.org/10.1145/3136755.3136803
  56. Michael A. Norman and P. J. Thomas. 1991. Informing HCI Design through Conversation Analysis. International Journal of Man-Machine Studies 35, 2 (1991), 235–250. https://doi.org/10.1016/S0020-7373(05)80150-6
  57. Kazuhiro Otsuka. 2011. Multimodal Conversation Scene Analysis for Understanding People’s Communicative Behaviors in Face-to-Face Meetings. In Proceedings of the 2011 International Symposium on Human Interface. Springer, Cham, Switzerland, 171–179. https://doi.org/10.1007/978-3-642-21669-5_21
  58. Designing Fair AI in Human Resource Management: Understanding Tensions Surrounding Algorithmic Evaluation and Envisioning Stakeholder-Centered Solutions. In Proceedings of the 2022 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 51:1–51:22. https://doi.org/10.1145/3491102.3517672
  59. B. V. Patel and B. B. Meshram. 2012. Content Based Video Retrieval Systems. arXiv abs/1205.1641 (2012), 18 pages. https://doi.org/10.48550/arXiv.1205.1641
  60. Leveraging Visual Feedback from Social Signal Processing to Enhance Clinicians’ Nonverbal Skills. In Proceedings of the 2013 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 421–426. https://doi.org/10.1145/2468356.2468431
  61. SceneSkim: Searching and Browsing Movies Using Synchronized Captions, Scripts and Plot Summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software and Technology. ACM, New York, NY, 181–190. https://doi.org/10.1145/2807442.2807502
  62. Allan Pease. 1981. Body language. Sheldon Press, London, UK.
  63. ConAn: A Usable Tool for Multimodal Conversation Analysis. In Proceedings of the 23rd ACM International Conference on Multimodal Interaction. ACM, New York, NY, 341–351. https://doi.org/10.1145/3462244.3479886
  64. Laura M. Pfeifer and Timothy W. Bickmore. 2009. Should Agents Speak Like, um, Humans? The Use of Conversational Fillers by Virtual Agents. In Proceedings of the 2009 International Conference on Intelligent Virtual Agents. Springer, Cham, Swizterland, 460–466. https://doi.org/10.1007/978-3-642-04380-2_50
  65. Discourse Behavior of Older Adults Interacting with a Dialogue Agent Competent in Multiple Topics. ACM Transactions on Interactive Intelligent Systems 12, 2 (2022), 14:1–14:21. https://doi.org/10.1145/3484510
  66. Robust Video Scene Detection Using Multimodal Fusion of Optimally Grouped Features. In Proceedings of the 19th IEEE International Workshop on Multimedia Signal Processing. IEEE, New York, NY, 1–6. https://doi.org/10.1109/MMSP.2017.8122267
  67. 300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge. In Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops. IEEE Computer Society, New York, NY, 397–403. https://doi.org/10.1109/ICCVW.2013.59
  68. MeetingCoach: An Intelligent Dashboard for Supporting Effective & Inclusive Meetings. In Proceedings of the 2021 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 252:1–252:13. https://doi.org/10.1145/3411764.3445615
  69. CoCo: Collaboration Coach for Understanding Team Dynamics during Video Conferencing. Proceedings of the ACM on Interactive, Mobile, Wearable, and Ubiquitous Technologies 1, 4 (2017), 160:1–160:24. https://doi.org/10.1145/3161186
  70. Jack Sidnell and Tanya Stivers (Eds.). 2012. The Handbook of Conversation Analysis. Wiley. https://doi.org/10.1002/9781118325001
  71. Aron W. Siegman and Stephen Boyle. 1993. Voices of Fear and Anxiety and Sadness and Depression: The Effects of Speech Rate and Loudness on Fear and Anxiety and Sadness and Depression. Journal of Abnormal Psychology 102, 3 (1993), 430–437. https://doi.org/10.1037/0021-843x.102.3.430
  72. Florian Spiess and Heiko Schuldt. 2022. Multimodal Interactive Lifelog Retrieval with vitrivr-VR. In Proceedings of the 5th Annual on Lifelog Search Challenge. ACM, New York, NY, 38–42. https://doi.org/10.1145/3512729.3533008
  73. Impact of Video Editing Based on Participants’ Gaze in Multiparty Conversation. In Extended Abstracts of the 2004 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 1333–1336. https://doi.org/10.1145/985921.986057
  74. Rhema: A Real-Time In-Situ Intelligent Interface to Help People with Public Speaking. In Proceedings of the 20th ACM International Conference on Intelligent User Interfaces. ACM, New York, NY, 286–295. https://doi.org/10.1145/2678025.2701386
  75. Albert Tarantola. 2005. Inverse Problem Theory and Methods for Model Parameter Estimation. Society for Industrial and Applied Mathematics, Philadelphia, PA. https://doi.org/10.1137/1.9780898717921
  76. O’Reilly Editorial Team. 2021. Low-Code and the Democratization of Programming Rethinking Where Programming Is Headed. https://www.oreilly.com/radar/low-code-and-the-democratization-of-programming/.
  77. Towards Creating a Conversational Memory for Long-Term Meeting Support: Predicting Memorable Moments in Multi-Party Conversations Through Eye-Gaze. In Proceeding of the 24th ACM International Conference on Multimodal Interaction. ACM, New York, NY, 94–104. https://doi.org/10.1145/3536221.3556613
  78. Attention is All you Need. In Proceedings of the 2017 Annual Conference on Neural Information Processing Systems. 5998–6008.
  79. Bridging the Gap between Social Animal and Unsocial Machine: A Survey of Social Signal Processing. IEEE Transactions on Affective Computing 3, 1 (2012), 69–87. https://doi.org/10.1109/T-AFFC.2011.27
  80. AI-assisted CT imaging analysis for COVID-19 screening: Building and deploying a medical AI system. Applied Soft Computing 98 (2021), 106897. https://doi.org/10.1016/j.asoc.2020.106897
  81. Robin Wooffitt. 2005. Conversation Analysis and Discourse Analysis. SAGE Publications. https://doi.org/10.4135/9781849208765
  82. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, 385:1–385:22. https://doi.org/10.1145/3491102.3517582
  83. Hiromu Yakura. 2023. A Generative Framework for Designing Interactions to Overcome the Gaps between Humans and Imperfect AIs Instead of Improving the Accuracy of the AIs. In Extended Abstracts of the 2023 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 479:1–479:5. https://doi.org/10.1145/3544549.3577036
  84. Tool- and Domain-Agnostic Parameterization of Style Transfer Effects Leveraging Pretrained Perceptual Metrics. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Zhi-Hua Zhou (Ed.). ijcai.org, Menlo Park, CA, 1208–1216.
  85. Recent Advance in Content-based Image Retrieval: A Literature Survey. arXiv abs/1706.06064 (2017), 22 pages. https://doi.org/10.48550/arXiv.1706.06064
  86. ViVo: Video-Augmented Dictionary for Vocabulary Learning. In Proceedings of the 2017 ACM SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 5568–5579. https://doi.org/10.1145/3025453.3025779
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: