Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Published 5 Apr 2026 in cs.MM, cs.AI, cs.CV, and cs.SD | (2604.04229v1)

Abstract: Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents HSC-MAE, a dual-path teacher-student model that enforces hierarchical semantic consistency for robust cross-modal retrieval.
It combines global DCCA alignment, local soft top-k InfoNCE loss, and sample-level masking to create modality-invariant embeddings.
Experiments on AVE and VEGAS show up to 15.72% mAP improvement over baselines, confirming the effectiveness of the proposed framework.

Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Introduction and Motivation

Unsupervised learning of cross-modal representations remains a significant challenge due to the weak, noisy, and ambiguous association between modalities in real-world corpora. The "Hierarchical Semantic Correlation-Aware Masked Autoencoder (HSC-MAE)" addresses unsupervised audio-visual representation learning from pre-extracted, paired—but unlabeled—audio-visual features. It directly confronts major limitations of pre-existing approaches such as strict single-positive assumptions, confirmation bias, and lack of mechanisms for robust intra-modal decoding, thereby enabling more reliable multimodal semantic alignment under weak supervision.

Methodology

HSC-MAE introduces a dual-path teacher-student framework that enforces semantic consistency hierarchically at three levels: global (canonical-geometry), local (neighborhood-semantics), and instance (conditional-sufficiency).

Global-level alignment: An EMA-updated teacher pathway operates in a clean (unmasked) regime and employs Deep Canonical Correlation Analysis (DCCA) to enforce that audio and visual embeddings share a modality-invariant low-dimensional subspace, providing a global geometric scaffold.
Local-level structure: The student pathway is regularized via a soft top- $k$ InfoNCE loss, where local semantic neighborhoods are mined and weighted with affinities provided by the stable teacher. This approach relaxes brittle single-positive matching in favor of more realistic multi-positive neighborhoods.
Sample-level robustness: The student employs sample-level feature masking and autoencoding, aiming for robustness to incomplete or noisy descriptors and ensuring conditional sufficiency of representations.
Multi-task optimization: All objectives are harmonized via learnable uncertainty-based multi-task weights, with an optional consistency loss distilling teacher geometry into the student.

This architecture enables decoupling of correlation and reconstruction objectives while strategically coupling them via cross-attention, a shared encoder, and soft-positive mining.

Figure 1: Schematic of HSC-MAE's dual-path architecture with shared encoders, cross-attention fusion, sample-level masking, EMA teacher, and hierarchical objective coordination.

Experimental Results

Datasets and Setup

HSC-MAE is evaluated on AVE and VEGAS datasets using pre-extracted audio (VGGish, 128-D) and visual (InceptionV3, 1024-D) descriptors. The model is trained entirely unsupervised, with cross-modal retrieval performance (UCMR: audio-to-visual, visual-to-audio) assessed via mean Average Precision (mAP).

Main Quantitative Results

HSC-MAE delivers significant improvements over all considered baselines:

On AVE, averaged mAP increases from 0.6165 (CAV-MAE) to 0.7737 (+15.72%) and outperforms InfoNCE, triplet, and DUMCH by a clear margin.
On VEGAS, it improves average mAP from 0.7535 to 0.8026 over CAV-MAE (+4.91%), and achieves higher scores than DCCA and contrastive learning variants.

The HSC-MAE model achieves the strongest reported unsupervised cross-modal retrieval results across both AVE and VEGAS.

Ablation Analysis

Ablation studies systematically demonstrate that:

Removal of masked reconstruction or soft top- $k$ InfoNCE substantially degrades retrieval accuracy, with up to 22% mAP drops.
Disabling the global CCA objective leads to large performance degradation, indicating the necessity of global structure.
Each loss contributes distinct, complementary benefits; their combination yields the most robust embedding geometry.
Figure 2: Left: Decomposition of training losses for different ablated variants. Right: Test mAP over training epochs, highlighting substantial performance drops with component removal.

Mask Ratio Sensitivity

Performance is contingent on an optimal feature mask ratio. Moderate masking (e.g., 0.2–0.3) yields maximal mAP; overly aggressive masking (≥0.5) impairs semantic alignment, while too little masking under-regularizes and weakens robustness.

Figure 3: Effect of sample-level mask ratio on UCMR task for AVE and VEGAS; optimal retrieval at moderate masking, with dataset-dependent gaps between retrieval directions.

Qualitative Retrieval

HSC-MAE demonstrates retrieval alignment that is robust to subtle cross-modal ambiguities. Querying with audio or visual features of a "truck" consistently retrieves semantically congruent top results; mismatches primarily reflect inter-class confusion (e.g., truck vs. bus) rather than modality-level noise or misalignment.

Figure 4: Qualitative top-10 retrievals for audio and visual queries (AVE), showing robust cross-modal alignment despite ambiguous categories.

Implications and Future Directions

HSC-MAE provides a general algorithmic framework for scalable, label-efficient learning in multimodal settings where only compact feature descriptors are available and paired supervision is weak or ambiguous. The approach is immediately applicable to embodied agents and large-scale multimedia retrieval, with strong implications for:

Robust multimodal embedding under partial observability and noise
Reductions in confirmation bias and improved mining of positives in multi-event corpora
Extension to arbitrary modalities and scalability to online/self-adaptive systems

Further directions include scaling to raw sensor data, causal discovery of multimodal structure, and integration with end-to-end video/sequence models.

Conclusion

HSC-MAE introduces hierarchical semantic correlation-aware constraints within a dual-path masked autoencoder, leveraging global DCCA alignment, local soft-neighborhood InfoNCE, sample-level masked autoencoding, and teacher-guided distillation. Strong empirical evidence attests to substantial mAP gains over diverse unsupervised baselines, with ablations confirming non-redundant contributions of each architectural innovation. The work consolidates effective solutions for unsupervised audio-visual representation, providing a robust template for future work in scalable multimodal learning.