LSZone: Lightweight In-Car Speech Separation

Updated 14 October 2025

LSZone is a lightweight spatial speech separation model that fuses Mel spectrograms with IPD cues to distinguish speech from different in-car zones.
It employs a SpaIEC module for extracting reduced-dimensional spatial features and a Conv-GRU CNP module for efficient cross-domain temporal modeling.
LSZone achieves real-time processing with 0.56G MACs and an RTF of 0.37 while demonstrating improved CER and FIR in noisy, multi-speaker scenarios.

LSZone is a lightweight spatial information modeling architecture specifically designed for real-time in-car multi-zone speech separation. This model aims to separate and extract speech signals originating from different spatial zones inside a vehicle, under constraints of computational efficiency and robustness to noise and speaker interference. LSZone leverages efficient spatial feature fusion and innovative cross-domain processing modules to address the elevated demands for natural human-vehicle interaction in complex acoustic environments.

1. Motivation and Context

In-car speech separation tasks must contend with overlapping sources, reverberant vehicle interiors, engine and road noise, and interaction between multiple speakers distributed across defined physical zones. Prior approaches, notably SpatialNet, encode spatial information via complex features such as the full complex-valued Short-Time Fourier Transform (STFT) and spatial cue extraction, but suffer from high computational complexity, creating barriers for deployment in embedded, resource-constrained automotive platforms. LSZone is introduced to overcome these barriers, focusing on minimal MACs (Multiply-Accumulate Operations), low real-time processing factors (RTF), and compact model design while maintaining separation performance even in adverse noise and multi-speaker conditions.

2. Architecture: SpaIEC and Conv-GRU CNP Modules

LSZone’s performance derives from two central modules:

A. SpaIEC Module (Spatial Information Extraction-Compression):

Rather than using the full STFT, which imposes a large feature dimensionality, SpaIEC fuses Mel spectrograms (lower-dimensional and robust for speech) with Interaural Phase Difference (IPD) cues. IPD captures the spatial phase differences between microphones, essential for disambiguating the origin zones of simultaneous speakers.
For each channel $z$ , pairwise IPDs are computed as $IPD_{z,k} = X_{phase}[z] - X_{phase}[k]$ ( $k \ne z$ ), and all IPDs for a channel are stacked ( $IPD_z = \bigoplus_{k \neq z} IPD_{z,k}$ ), then concatenated across all zones.
This IPD tensor is then compressed by a Conv1D "Squeezer," and a Gate Fusion module (Conv1D+Sigmoid) merges compressed IPD features with Mel spectrogram features, producing a reduced-dimensionality spatial-spectral-tensor $X_s \in \mathbb{R}^{N_{mel} \times T \times Z}$ .

B. Conv-GRU CNP Module (Crossband-Narrowband Processing):

To further limit computation, LSZone employs alternating blocks: "Conv Crossband," which applies parallel Frequency Conv1D and Group Conv1D across frequency bands to model spatial-frequency interactions, and "GRU Narrowband," which applies LayerNorm, GRU, and Linear layers for temporal modeling in a dimensionally economical fashion.
This sequence achieves interactive modeling of spatial, spectral, and temporal information with minimal parameter count and computational overhead, replacing heavier LSTM/attention-based modules common to previous systems.

3. Computational Complexity and Efficiency Metrics

Quantitative evaluation establishes the efficiency of LSZone:

Computational Complexity (MACs): LSZone achieves a complexity of 0.56G MACs—a dramatic reduction compared to SpatialNet (e.g., 6.98G MACs).
Real-Time Factor (RTF): LSZone’s RTF is 0.37, supporting real-time operation on an Intel Xeon Platinum 8180 CPU, whereas SpatialNet requires an RTF of 2.91.
Character Error Rate (CER) and False Intrusion Rate (FIR): The model attains a CER of 17.20% and FIR of 10.26%, outperforming comparative baselines in mixed-speaker, noisy environments.

Model	MACs (G)	RTF	CER (%)	FIR (%)
LSZone	0.56	0.37	17.20	10.26
SpatialNet	6.98	2.91	higher	higher

These metrics demonstrate LSZone’s real-time suitability and empirical effectiveness for in-car speech separation. The drastic MAC and RTF reductions are attributed to the SpaIEC and Conv-GRU CNP module efficiencies.

4. Experimental Protocol and Results

LSZone is benchmarked in a scenario involving a six-zone distributed microphone array within a simulated vehicle environment. Speech mixtures are generated using clean speech from AISHELL-1 and noise from DNS Challenge 5, convolved with simulated Room Impulse Responses (RIRs). The dataset consists of 120k training clips and validation/test sets designed for single-, two-, and three-speaker conditions.

On these tests, LSZone surpasses baselines (Zoneformer, DualSep, and SpatialNet) especially under multi-speaker and noisy conditions, as evidenced by lower CER and FIR.
Visualizations in the paper present the flow of inputs through the SpaIEC and Conv-GRU CNP modules, with tables confirming the trade-offs between parameter count, performance, and computational efficiency.
Module ablation experiments show that Mel spectrogram–IPD fusion outperforms Mel-only or STFT-only feature constructions, confirming the design decisions.

5. Significance and Limitations

LSZone demonstrates that Mel spectrogram/IPD fusion effectively preserves spatial cues with reduced feature dimensionality, and the Conv-GRU CNP module efficiently captures spatio-temporal patterns without incurring the high cost of recurrent or attention-based layers. LSZone’s performance validates that spatial information can be encoded in a lightweight form sufficient for real-time deployment in an automotive environment.

Limitations include the fixed structure of feature fusion and processing, which may require adaptation for environments with dramatically different microphone arrays or acoustics. The evaluation is performed on a simulated dataset; assessment in real in-car environments would further establish practical robustness. Extension to more complex ambient conditions or integration with end-to-end ASR is suggested as a future direction.

6. Future Directions

Advancements may focus on further reducing computational bottlenecks, designing alternative fusion mechanisms for spatial cues, and extending the model to more zones and dynamic speaker configurations. Integration with real-time ASR systems and adaptation for diverse vehicle types is also proposed. Module-level optimizations (such as enhancing the SpaIEC fusion or employing novel temporal modeling strategies) could further improve LSZone’s efficiency and accuracy in larger fleets or commercial deployments.

7. Summary

LSZone constitutes a compact architecture for in-car spatial speech separation, combining a Mel spectrogram/IPD-based feature compression (SpaIEC) and a computationally efficient spatio-temporal modeling module (Conv-GRU CNP). It achieves real-time processing with 0.56G MACs and an RTF of 0.37, and demonstrates robust performance in challenging multi-speaker/noise scenarios. The model’s architectural innovations, validated through comparative metrics, establish a pathway for scalable, resource-efficient multi-zone speech separation in next-generation human-vehicle interfaces (Chen et al., 12 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation (2025)

Follow Topic

Get notified by email when new papers are published related to LSZone.