Hierarchical Gated Fusion Decoder (HiGate)
- Hierarchical Gated Fusion Decoder (HiGate) is a multimodal mechanism that progressively injects gated information between modalities for improved active speaker detection.
- It employs layer-wise fusion at designated Transformer depths, enabling fine-grained, context-aware integration of audio and visual features.
- Empirical results show significant performance gains on ASD benchmarks with minimal computational overhead, validating its efficiency and effectiveness.
The Hierarchical Gated Fusion Decoder (HiGate) is an architectural component for multimodal learning that performs progressive, layer-wise gated information injection between modalities—specifically designed to address limitations of late fusion in tasks such as Active Speaker Detection (ASD). HiGate was introduced within the GateFusion model, which achieved state-of-the-art results on several ASD benchmarks by enabling fine-grained cross-modal interactions between strong pretrained audio and visual encoders (Wang et al., 17 Dec 2025).
1. Architectural Role and Integration
HiGate is situated at the interface between two strong, pretrained unimodal encoders—AV-HuBERT (video) and Whisper (audio)—and a lightweight classifier. Each encoder consists of a deep stack of Transformer or hybrid ResNet+Transformer layers, with respective hidden states for modality . Hidden states from the final encoder layer are projected to a shared feature width via a linear transformation:
where is batch size and is the sequence length for modality .
HiGate does not simply fuse and at the end (“late fusion”), but instead refines the representation of a “primary” modality by repeatedly injecting gated information from several depths of the “context” modality’s encoder. This process is conducted symmetrically for both directions, generating enriched representations and , which are temporally aligned and summed before classification.
2. Gated Fusion Mechanism
For a given fusion step at context layer , HiGate operates as follows:
- The hidden state from the context modality is projected to width and temporally aligned to the primary modality’s length, producing , where denotes pooling or interpolation.
- The primary feature and aligned context feature are concatenated and passed through a linear layer and sigmoid activation to generate a bimodally-conditioned gate:
- The context feature is injected, modulated by the gate, combined with the primary, and subjected to LayerNorm:
This process allows each fusion step to adaptively blend multimodal information based on both modalities’ current representations, promoting fine-grained, context-aware enrichment at each selected layer.
3. Progressive Multi-Depth Fusion Strategy
GateFusion utilizes HiGate at four designated encoder depths (): out of the first 12 Transformer layers. This “coarse-to-fine” schedule ensures progressive incorporation of low-, mid-, and high-level features from the context modality. The full fusion procedure for, e.g., video as primary and audio as context, is as follows:
1 2 3 4 5 6 7 8 9 10 |
Input: f_v, {h_a^l}_{l=1..L}, L_fuse={1,4,7,10}
Initialize: tilde_f_v ← f_v
for l in L_fuse do
h̃_a ← I(φ(h_a^l), T_v) # project & align
g^l ← σ( W_g^l [tilde_f_v; h̃_a] + b_g^l )
tilde_f_v ← LN( tilde_f_v + g^l ⊙ h̃_a )
end for
Output: enriched tilde_f_v |
4. Auxiliary Objectives and Loss Interaction
To improve the robustness and consistency of unimodal branches, HiGate incorporates two auxiliary objectives in addition to the main multimodal cross-entropy loss.
- Masked Alignment Loss (MAL): For each positive frame (speaker present), MAL computes KL divergences between the unimodal outputs and the joint multimodal prediction, masking terms to only consider active speech intervals:
where indexes positive frames and (resp. ) measures divergence between joint output and audio (resp. video) predictions.
- Over-Positive Penalty (OPP): To suppress spurious video-only activations when no speech is present,
where is the video branch’s positive class probability on frame and is the ground truth.
- Total loss: The joint loss is:
with , .
These auxiliary terms encourage unimodal and multimodal congruence during speaking intervals and penalize silent-frame false positives from visual signals.
5. Model Capacity, Training, and Computational Footprint
HiGate introduces additional parameters primarily via gating modules of size , for a total augmentation of approximately $13$ million parameters over the baseline. The computational overhead is moderate:
- VRAM usage increases by approximately compared to a simple late-fusion scheme.
- Inference speed drops by $10$– relative to late-fusion, but remains faster than attention-heavy decoders such as LoCoNet.
Training employs AdamW optimizer, with encoder learning rate and decoder , using a batch size of $1,500$ frames and a truncated Transformer layer regime. Learning rates are decayed by $0.95$ every $3,000$ steps for a total of $30,000$ steps.
6. Empirical Performance and Context
HiGate demonstrated substantial empirical improvements for ASD, including increases of mAP (to ) on Ego4D-ASD, (to ) on UniTalk, and (to ) on WASD. Relative to late fusion, it preserved fine-grained information and enabled multi-level context-aware enrichment, as evidenced by notable gains (e.g., mAP on Ego4D-ASD). The model also showed robust generalization across domains.
A plausible implication is that HiGate’s progressive, layerwise cross-modal gating paradigm can be adapted for other multimodal tasks where late fusion bottlenecks fine-grained interaction and temporal alignment. The auxiliary objective integration further suggests avenues for principled unimodal-multimodal consistency regularization in related applications (Wang et al., 17 Dec 2025).