Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

131 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Disentangling Dual-Encoder MAE

Updated 30 June 2025

The paper introduces a dual-encoder MAE that segregates disease-related signals from domain-specific noise using a shared decoder and reconstruction loss.
It employs Siamese and mutual-information minimization losses to enforce statistical independence between the two encoder outputs.
Empirical results in respiratory sound classification show enhanced sensitivity and robust performance despite domain shifts.

Disentangling Dual-Encoder MAE (DDE-MAE) refers to a class of architectures in which a masked autoencoder (MAE) is equipped with two independent encoder networks, each specializing in representing distinct factors of variation—typically by separating target (task-relevant) information from nuisance (task-irrelevant or domain-dependent) information. The DDE-MAE approach is situated at the intersection of feature disentanglement and self-supervised masked autoencoding, aiming to address domain shift, robustness, and interpretability in deep learning representations, with notable applications in fields such as respiratory sound classification.

1. Architectural Design: Dual-Encoder Masked Autoencoding

DDE-MAE extends the conventional masked autoencoder framework by introducing two structurally independent encoders and a shared decoder:

Disease-Related Encoder ( $E_r$ ): Processes the raw input (e.g., audio spectrogram), extracting features most pertinent to the target clinical condition.
Disease-Irrelevant Encoder ( $E_i$ ): Processes both the original and a time-shuffled variant of the input, which is constructed to destroy temporal dependencies and thus obscure disease-relevant cues while retaining device, environment, or demographic characteristics.

The outputs of both encoders are fused and provided to a decoder network, which reconstructs the original (or masked) input. During training, reconstruction loss is employed alongside specialized loss functions to enforce the desired factorization of information.

Diagrammatic overview (as in the referenced work):

Input X -----┬-------------> [Er] ------┐
             │                           \
         Time shuffling (→ 𝑋̃)        [Ei] -----┬----> [Decoder] → Reconstruction
             └--------------------------/      /

This arrangement ensures that each encoder is incentivized to specialize in complementary, minimally overlapping information.

2. Mechanisms of Feature Disentanglement

Feature disentanglement aims to enforce that learned representations capture statistically and semantically independent factors of variation. DDE-MAE leverages several mechanisms achieving explicit disentanglement:

Siamese Loss on Disease-Irrelevant Encoder ( $E_i$ ): A contrastive loss is applied to the outputs of $E_i$ when fed the original and time-shuffled versions of the same input, enforcing invariance to temporal structure:

$\mathcal{L}_{\text{siamese}} = \frac{1}{N} \sum_{i=1}^N \max(\lVert x_i - \tilde{x}_i\rVert_2 - \text{margin}, 0)$

This ensures that $E_i$ learns to ignore disease-relevant artifacts carried by temporal dependencies.

Mutual Information Minimization via vCLUB: To ensure statistical independence between the outputs of $E_r$ (disease-related) and $E_i$ (disease-irrelevant), DDE-MAE estimates and minimizes mutual information between these feature spaces using the vCLUB algorithm:

$I(Z_R, Z_I) \leq \mathbb{E}_{p(Z_R, Z_I)}[\log q_\theta(Z_I \mid Z_R)] - \mathbb{E}_{p(Z_R)} \mathbb{E}_{p(Z_I)}[\log q_\theta(Z_I \mid Z_R)]$

Masked Autoencoding Loss: Ensures that both encoders’ outputs are necessary and sufficient to reconstruct the observed input, enforcing complementary specialization.

This combination structurally and statistically encourages separation of target information (e.g., disease cues) from confounding factors (device identity, subject, environment), reducing the risk of learned models overfitting to domain artifacts.

3. Empirical Performance in Respiratory Sound Classification

DDE-MAE has been evaluated on the ICBHI respiratory cycles dataset, which comprises 6,898 cycles recorded using multiple devices and patient populations, a setting rife with domain shift and diversity.

Evaluation metrics:

Specificity ( $S_p$ ): Correctly identified normal cycles.
Sensitivity ( $S_e$ ): Correctly identified abnormal cycles (crackle, wheeze, both).
Average Score (AS): Mean of specificity and sensitivity.

Quantitative results demonstrate that DDE-MAE achieves the highest sensitivity (53.69%) among contemporary models, a critical metric in clinical diagnostic settings. The architecture is also competitive in overall score and specificity.

Model	Specificity	Sensitivity	Avg. Score
DDE-MAE	69.32	53.69	61.50
SG-SCL	79.87	43.55	61.71
AST+patch-mix CL	81.66	43.01	62.37
ARSC-Net	67.13	46.38	56.76

Ablation studies confirm the necessity of both Siamese and mutual information losses: removing either component degrades both sensitivity and specificity.

4. Addressing Domain Mismatch: Theory and Practice

Domain mismatch—arising from divergent devices, patients, or environments—poses a significant barrier to generalization in medical sound classification. DDE-MAE addresses this challenge through:

Self-supervised disentanglement: No domain labels are required; separation is enforced via architectural and loss-design choices.
Dual encoding of relevant and irrelevant factors: By isolating and minimizing the interaction between encoders’ outputs, disease cues are prevented from co-mingling with spurious domain-specific artifacts.
Invariance regularization: Siamese loss enforces that the disease-irrelevant encoder is oblivious to temporal (and thus, disease-specific) structure, aligning its focus on slowly varying, device- or identity-associated patterns.

This approach obviates the need for curated domain metadata, making DDE-MAE attractive for diverse real-world settings where domain attributes are unavailable or unreliable.

5. Loss Functions and Mathematical Formulation

The DDE-MAE training objective integrates three principal loss terms:

Reconstruction loss:

$\mathcal{L}_{\text{recon}} = \frac{1}{N} \sum_{i=1}^{N} \lVert x_i - \hat{x}_i \rVert_2^2$

Siamese loss (disease-irrelevant invariance):

$\mathcal{L}_{\text{siamese}} = \frac{1}{N} \sum_{i=1}^{N} \max (\lVert x_i - \tilde{x}_i\rVert_2 - \mathit{margin}, 0)$

Mutual information minimization (vCLUB):

$I(Z_R, Z_I) \leq \mathbb{E}_{p(Z_R, Z_I)}[\log q_\theta(Z_I \mid Z_R)] - \mathbb{E}_{p(Z_R)} \mathbb{E}_{p(Z_I)}[\log q_\theta(Z_I \mid Z_R)]$

The composite loss function: $\mathcal{L} = \mathcal{L}_{\text{recon}} + \alpha_1 \mathcal{L}_{\text{siamese}} + \alpha_2 \mathcal{L}_{\text{MI}}$ with $\alpha_1, \alpha_2$ balancing contributions.

6. Limitations and Prospects for Future Research

While DDE-MAE achieves pronounced improvements in disentanglement and domain robustness, several limitations and research directions are identified:

Imperfect separation: Qualitative analyses reveal residual leakage between disease-related and domain-related features, suggesting further innovation in loss formulation or adversarial constraints could enhance complete disentanglement.
Hierarchical or subtle domain shifts: Current formulation targets gross domain factors; finer-grained or hierarchical disentanglement remains a challenge.
Expansion to multimodal settings: Incorporating auxiliary data (e.g., demographics, device type) in a weakly supervised framework may further boost robustness.
Applicability beyond respiratory audio: The DDE-MAE framework is general and plausible for adoption in other clinical or sensor-based domains facing analogous domain mismatch challenges.
Robustness in real-world deployment: Broader evaluation across previously unseen devices or environments is necessary to fully ascertain generalization performance.

7. Comparative Context and Theoretical Foundations

DDE-MAE builds on earlier disentanglement frameworks:

Dual swap disentangling (1805.10583): Semi-supervised dual autoencoders using encoding-swap-decoding and cycle-swap processes to enforce dimension-wise modularity.
Decorrelation regularization (2001.08572): Distance covariance loss for factor independence in dual-encoder settings, offering precise control of disentanglement.
Density estimation techniques (2302.04362): Conditional density estimation for robust disentanglement in high-dimensional latent spaces, potentially informing DDE-MAE regularization strategies.
Group-theoretic and symmetry-based disentanglement (2202.09926): Deterministic, non-probabilistic autoencoders leveraging symmetry transformations, pointing toward non-stochastic regularizer-free variants.

The DDE-MAE approach incorporates several of these principles, including architectural separation, statistical independence, and invariance constraints, underpinned by theoretical insights from masked autoencoding (2202.03670).

Summary Table: DDE-MAE Model Highlights

Aspect	Description
Architecture	Dual encoders ( $E_r$ , $E_i$ ) plus shared decoder
Disentanglement	Siamese loss (invariance), vCLUB MI loss (independence)
Performance	Highest sensitivity (53.69%) on ICBHI; competitive overall
Domain mismatch	Addressed without need for explicit domain labels
Loss function	Combined reconstruction, Siamese, and MI minimization losses
Future directions	Complete separation, richer domains, multimodal/metadata use

DDE-MAE represents an empirically validated methodology for robust, interpretable representation learning under domain shift, achieving competitive performance in clinically relevant sound classification without requiring domain annotations or data augmentation tricks.

PDF Markdown Chat (Upgrade)

References (5)

Dual Swap Disentangling (2018)

Toward a Controllable Disentanglement Network (2020)

Disentangling Learning Representations with Density Estimation (2023)

Disentangling Autoencoders (DAE) (2022)

How to Understand Masked Autoencoders (2022)