An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits (2212.10744v2)

Published 21 Dec 2022 in cs.SD and cs.CV

Abstract: Audio-visual approaches involving visual inputs have laid the foundation for recent progress in speech separation. However, the optimization of the concurrent usage of auditory and visual inputs is still an active research area. Inspired by the cortico-thalamo-cortical circuit, in which the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus, we propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS). First, the CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visual subnetworks, mimicking the functions of the auditory and visual cortical areas. Then, inspired by the large number of connections between cortical regions and the thalamus, the model fuses the auditory and visual information in a thalamic subnetwork through top-down connections. Finally, the model transmits this fused information back to the auditory and visual subnetworks, and the above process is repeated several times. The results of experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters. These results suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks. Project repo is https://github.com/JusperLee/CTCNet.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a novel cortico-thalamo-cortical inspired model, CTCNet, which improves audio-visual speech separation through hierarchical and iterative multimodal fusion.
It employs distinct auditory and visual subnetworks with a thalamic integration module, resulting in enhanced SDRi and SI-SNRi metrics on multiple benchmark datasets.
The efficient design of CTCNet, with fewer parameters than state-of-the-art models, highlights the potential of brain-inspired architectures for advanced audio processing applications.

Overview of "An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits"

The paper "An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits" explores the integration of multimodal audio-visual inputs to improve speech separation systems, a task that mimics the human ability to discern distinct audio sources in complex acoustic scenes or the "cocktail party effect". The authors present a novel model, CTCNet, inspired by the cortico-thalamo-cortical circuits of the brain, highlighting its capability to outperform existing audio-visual speech separation (AVSS) models, especially within lower parameter domains.

Key Contributions

The primary contribution of the paper is the introduction of the cortico-thalamo-cortical neural network (CTCNet) designed for audio-visual speech separation. This novel approach is inspired by the anatomical and functional structures of the human brain, particularly focusing on:

Hierarchical Information Processing:
- The CTCNet model utilizes separate auditory and visual subnetworks constructed to replicate the hierarchical processing observed within the cortical areas of the mammalian brain. This setup enables the system to learn distinct representations of auditory and visual inputs in a bottom-up fashion.
Thalamic Subnetwork Integration:
- A key innovative feature of the CTCNet is the incorporation of a thalamic subnetwork that integrates auditory and visual information. This integration mimics the thalamus’s role in intermodal sensory processing, reflecting the multimodal fusion seen in the brain through top-down interactions from cortical areas to the thalamus.
Iterative Processing:
- The model cyclically processes and refines the fused information, sending feedback to the auditory and visual subnetworks, enhancing the integration and separation processes through iterative multimodal interactions.

Experimental Results

The researchers evaluated the CTCNet on three benchmark datasets: LRS2-2Mix, LRS3-2Mix, and VoxCeleb2-2Mix, which vary in speech quality and noise levels. The model demonstrated a robust separation capability across these datasets. Notable results include:

The CTCNet model achieved substantial improvement over existing AVSS methods, with SDRi (Signal-to-Distortion Ratio improvement) and SI-SNRi (Scale-Invariant Signal-to-Noise Ratio improvement) metrics significantly exceeding those of competing models.
With respect to parameter efficiency, CTCNet achieved superior performance with fewer parameters compared to state-of-the-art methods, accentuating the benefits of the brain-inspired design in maintaining computational efficiency while enhancing model efficacy.

Implications and Future Directions

The paper suggests that emulating neural architectures like cortico-thalamo-cortical circuits could be pivotal in enhancing the capability of deep neural networks for complex tasks like speech separation. Given its demonstrated efficiency and effectiveness, CTCNet could play a crucial role in applications requiring audio-visual processing, such as hearing aids and audio enhancement systems in complex environments.

The authors point to potential extensions of this work by exploring more intricate modeling of neural systems that incorporate higher-order brain regions and broader brain network interactions beyond the thalamic and cortical structures. Further exploration involving extended model architectures, enriched training datasets, and integration with other sensory modalities could unfold new avenues for advancements in artificial intelligence systems that parallel human-like sensory integration processes.

In conclusion, this research exhibits promising advancements in harnessing neuro-inspired models for AVSS, potentially steering future AI development towards more biologically plausible and functionally advanced systems. By mirroring the rich, integrative networks found in the brain, models like CTCNet could fundamentally transform our approach to solving complex sensory processing tasks in AI.

PDF Markdown