DSSCNet: Dual Domain Deep Architectures
- DSSCNet is a dual-purpose deep architecture: one model classifies dysarthric speech using CNNs with squeeze-excitation and residual modules, while the other performs deep subspace clustering via a self-expressive autoencoder.
- The speech classification component employs rigorous preprocessing, cross-corpus fine-tuning, and ablation analyses to significantly outperform traditional methods.
- The subspace clustering model integrates a self-expressive layer within an autoencoder framework to learn robust latent representations, reducing clustering errors on benchmark datasets.
DSSCNet refers to two distinct deep learning architectures, each targeting a fundamentally different domain. The first, Dysarthric Speech Severity Classification Network (DSSCNet), is designed for fine-grained classification of dysarthric speech severity in a speaker-independent setting by leveraging convolutional, squeeze-excitation, and residual networks. The second, Deep Subspace Clustering Network (DSSCNet), is an unsupervised architecture for nonlinear subspace clustering, introducing a self-expressive layer between encoder and decoder in a deep autoencoder. Each system is characterized by its own architectural principles, objectives, and evaluation standards.
1. DSSCNet for Dysarthric Speech Severity Classification
DSSCNet implements a deep convolutional framework for discriminative modeling of dysarthric speech severity. The pipeline initiates with preprocessing of raw audio: waveforms at 16 kHz are trimmed for silence, segmented with 16 ms Hanning windows and 4 ms hop size, analyzed via STFT, filtered with 128 mel bands, and then log-compressed. The resultant 128 × 128 spectrogram is resized and duplicated into three channels to conform with CNN input conventions.
The architecture is structured in three principal stages:
- Convolutional Backbone (CNet): Composed of three convolutional layers (64, 128, 256 filters; 3 × 3 kernels; batch normalization; ReLU activation; 2 × 2 max pooling), outputting a 256 × 16 × 16 tensor.
- Squeeze-Excitation (SE) Block: Channel-wise weights are adaptively computed using global average pooling, followed by a bottlenecked two-layer MLP (bottleneck ratio r = 16, with ReLU and sigmoid activations). Each channel is multiplied by its corresponding learned scalar, focusing attention on salient dysarthria cues.
- Residual Network Module: The 256 × 16 × 16 output is processed through three residual fully-connected blocks (dimensions: 256 → 256, 256 → 512, 512 → 1024), each with an identity skip connection to ameliorate vanishing gradients and facilitate deeper semantic modeling of spectrotemporal patterns.
Training employs weighted cross-entropy with inverse class frequency rebalancing over four severity classes, Adam optimizer (learning rate 10⁻³), batch size 16, and early stopping. Weight decay suffices for regularization; no dropout is utilized.
Cross-corpus fine-tuning is a two-step process: initial training on a source corpus (e.g., UA-Speech) and subsequent fine-tuning with reduced learning rate on a target corpus (e.g., TORGO), unfreezing only the last residual blocks and output layer. This protocol facilitates transfer of learned filters across diverse clinical data distributions, yielding marked accuracy improvements.
2. DSSCNet for Deep Subspace Clustering
Deep Subspace Clustering Network reframes subspace clustering as a deep autoencoder problem augmented by a self-expressive intermediary. The architecture proceeds as follows:
- Encoder : Maps each input to latent using convolutional or fully-connected layers.
- Self-Expressive Layer : Implements , using a parameter matrix without activation or bias.
- Decoder : Reconstructs from the self-expressed latent representation.
The self-expressive property is regularized via:
$Z = ZC, \quad \text{with regularization on $C\ell_1$ or Frobenius norm).}$
The composite loss function is:
The network is pre-trained as a standard autoencoder, then fine-tuned in full-batch mode using Adam. After convergence, the affinity matrix is formed and spectral clustering is applied to yield final labels.
3. Evaluation Protocols and Experimental Outcomes
For dysarthric speech severity classification, two primary protocols are applied:
- One-Speaker-Per-Severity (OSPS): Each severity level in the test set is represented by a unique speaker.
- Leave-One-Speaker-Out (LOSO): Each fold holds out all utterances from one speaker for testing.
Summary of observed results:
| Dataset | Protocol | Baseline CNN+SE | DSSCNet (w/o FT) | DSSCNet (w/ FT) |
|---|---|---|---|---|
| TORGO | OSPS | 44.04 | 56.84 | 75.80 |
| UA-Speech | OSPS | 47.91 | 62.62 | 68.25 |
| TORGO | LOSO | 54.66 | 63.47 | 77.76 |
| UA-Speech | LOSO | 56.84 | 64.18 | 79.44 |
Compared to prior methods such as SECNN and HuBERT-based pipelines, DSSCNet demonstrates absolute improvements between 2–25% depending on protocol and dataset, particularly after cross-corpus fine-tuning (Roy et al., 16 Sep 2025).
For deep subspace clustering, evaluation utilizes clustering error rates on challenging benchmarks (Extended Yale B, ORL, COIL20, COIL100). DSSCNet-L2 achieves up to lower error compared to EDSC and AE+EDSC baselines, especially in high cluster-count or nonlinear settings (Ji et al., 2017).
4. Ablation Analyses and Representation Insights
Ablation studies reveal the architectural contributions to classification accuracy for dysarthric speech. Removing the SE block reduces OSPS accuracy on TORGO by 4.85%, omitting the first convolutional layer by 7.21%, and excising the first residual block by 13.58%. Eliminating residual blocks one and two decreases performance by 10.48%. Models with SE blocks show accelerated loss convergence and achieve lower minima.
t-SNE visualization of latent embeddings demonstrates that DSSCNet with cross-corpus fine-tuning produces well-separated clusters for all four severity levels, while baseline systems show significant overlap between classes.
For deep subspace clustering, the regularization variant is more stable during optimization, and the joint learning paradigm for representation and self-expressive coefficients consistently outperforms decoupled alternatives. The affinity matrix formed from robustly encodes pairwise relationships for spectral clustering.
5. Architectural and Methodological Distinctions
DSSCNet for dysarthric speech classification uniquely integrates spectrotemporal CNN backbones, channel-wise attention (SE), and residual connections, culminating in robust, transferable representations for medical audio analysis. In contrast, DSSCNet for subspace clustering are autoencoder-based with a critical self-expressive layer to enforce latent structural constraints. The two models are not interchangeable; their architectures, objectives, and evaluation methodologies are defined by domain-specific demands.
6. Significance and Future Directions
DSSCNet’s design for dysarthric speech achieves state-of-the-art speaker-independent generalization by harmonizing low-level acoustic cue extraction, channel re-weighting, and depth. Cross-corpus transfer learning further boosts performance, suggesting robust adaptability to clinical data heterogeneity (Roy et al., 16 Sep 2025).
In unsupervised subspace clustering, DSSCNet demonstrates that embedding self-expressiveness into deep nonlinear models enables cutting-edge clustering performance, particularly for complex, high-dimensional, and nonlinear manifolds (Ji et al., 2017).
A plausible implication is that architectural modularity—combining convolution, attention, residual, or self-expressive mechanisms as dictated by the application—can be leveraged to create advanced domain-specialized deep networks. These complementary approaches highlight DSSCNet’s role as a designator for high-specialization deep systems across both supervised medical categorization and unsupervised representation learning.