Audio-visual child-adult speaker classification in dyadic interactions (2310.01867v2)
Abstract: Interactions involving children span a wide range of important domains from learning to clinical diagnostic and therapeutic contexts. Automated analyses of such interactions are motivated by the need to seek accurate insights and offer scale and robustness across diverse and wide-ranging conditions. Identifying the speech segments belonging to the child is a critical step in such modeling. Conventional child-adult speaker classification typically relies on audio modeling approaches, overlooking visual signals that convey speech articulation information, such as lip motion. Building on the foundation of an audio-only child-adult speaker classification pipeline, we propose incorporating visual cues through active speaker detection and visual processing models. Our framework involves video pre-processing, utterance-level child-adult speaker detection, and late fusion of modality-specific predictions. We demonstrate from extensive experiments that a visually aided classification pipeline enhances the accuracy and robustness of the classification. We show relative improvements of 2.38% and 3.97% in F1 macro score when one face and two faces are visible, respectively.
- “Prevalence and characteristics of autism spectrum disorder among children aged 8 years—autism and developmental disabilities monitoring network, 11 sites, united states, 2020,” MMWR Surveillance Summaries, vol. 72, no. 2, pp. 1, 2023.
- “The autism diagnostic observation schedule-generic: A standard measure of social and communication deficits associated with the spectrum of autism,” J. of Autism and Developmental Disorders, vol. 30, pp. 205–223, 2000.
- “Eliciting language samples for analysis (elsa): A new protocol for assessing expressive language and communication in autism,” Autism research, vol. 14, no. 1, pp. 112–126, 2021.
- “Cross-modal coordination of face-directed gaze and emotional speech production in school-aged children and adolescents with asd,” Scientific reports, vol. 9, no. 1, pp. 18301, 2019.
- “Understanding Spoken Language Development of Children with ASD Using Pre-trained Speech Embeddings,” in Proc. INTERSPEECH 2023, 2023, pp. 4633–4637.
- “Intra-topic latency as an automated behavioral marker of treatment response in autism spectrum disorder,” Scientific reports, vol. 12, pp. 3255, 2022.
- “Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation,” in 2021 IEEE Winter Conf. on Applications of Computer Vision (WACV), 2021, pp. 1547–1557.
- “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- “Robust speech recognition via large-scale weak supervision,” in International Conf. on Machine Learning. PMLR, 2023, pp. 28492–28518.
- “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020.
- “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- “A light weight model for active speaker detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22932–22941.
- “Speaker independent diarization for child language environment analysis using deep neural networks,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 114–120.
- “Talker diarization in the wild: The case of child-centered daylong audio-recordings,” in Interspeech 2018, 2018, pp. 2583–2587.
- “Meta-learning for robust child-adult classification from speech,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 8094–8098.
- “Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism,” in Proc. INTERSPEECH 2023, 2023, pp. 3557–3561.
- “Ava active speaker: An audio-visual dataset for active speaker detection,” in IEEE International Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 4492–4496.
- “S3fd: Single shot scale-invariant face detector,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 192–201.
- “Remote natural language sampling of parents and children with autism spectrum disorder: Role of activity and language level,” Frontiers in Communication, vol. 7, 2022.
- “Fine motor skill and expressive language in minimally verbal and verbal school-aged autistic children,” Autism Research, 2022.
- “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
- “Torchvision the machine-vision package of torch,” in Proc. of the 18th ACM International Conf. on Multimedia, 2010, pp. 1485–1488.