Mask-Guided Visual Encoder (MVE)
- The paper introduces MVE as a design principle that integrates masks into visual feature extraction to focus on anatomically, semantically, or user-specified regions.
- MVE architectures vary across domains, employing input-level mask channels, dual-branch pathways, or mask-guided supervision in tasks like medical imaging and audiovisual generation.
- Empirical results show that mask guidance significantly improves performance metrics in classification, segmentation, and registration across diverse applications.
Mask-guided Visual Encoder (MVE) denotes a family of architectures in which a binary or soft mask is incorporated into visual feature extraction so that representation learning is constrained by anatomically, semantically, or user-specified regions rather than only global image content. In recent work, the term has been used for at least five distinct but related constructions: a mask-channel visual pathway in fetal ultrasound vision-language pretraining, a two-pass longitudinal encoder with a shared change mask for medical visual question answering, a dual-branch object-aware video encoder for interactive video-to-audio generation, a multi-scale region-style extractor for talking-face synthesis, and, in MrRegNet, a mask-guided supervision path built around a multi-resolution encoder-decoder rather than a separate detailed MVE subnetwork (Su et al., 28 Jun 2026, Wu et al., 3 Jun 2026, Liang et al., 7 Jul 2025, Xiong et al., 2024, Li et al., 2024).
1. Taxonomy of the term
The term MVE does not refer to a single canonical architecture. Instead, it designates a recurrent design principle: masks are used to bias visual encoding toward local structures that are decisive for downstream prediction, alignment, generation, or registration. The concrete implementation differs substantially across domains, including fetal ultrasound, chest X-ray longitudinal reasoning, video-to-audio generation, talking-face synthesis, and deformable medical image registration (Su et al., 28 Jun 2026, Wu et al., 3 Jun 2026, Liang et al., 7 Jul 2025, Xiong et al., 2024, Li et al., 2024).
| Work | Mask role | Encoder form |
|---|---|---|
| SonoCLIP (Su et al., 28 Jun 2026) | mask-channel visual prompt | parallel convolutional stems with additive fusion |
| Longitudinal Medical VQA (Wu et al., 3 Jun 2026) | shared change mask for two-pass encoding | registered pair, mask generation, masked re-encoding |
| Hear-Your-Click (Liang et al., 7 Jul 2025) | object-centric masking for audiovisual alignment | dual-branch SlowOnly encoder |
| SegTalker (Xiong et al., 2024) | region-wise style extraction | multi-scale encoder with mask-guided pooling |
| MrRegNet (Li et al., 2024) | multi-scale ROI supervision | multi-resolution encoder-decoder with mask-guided loss |
This dispersion of usage is important. In some papers, the mask is an input-channel prompt injected into the visual stem; in others, it is a generated latent selector reused across multiple images; in generative models, it may serve as a semantic partition for region-specific style coding; and in registration, it may operate only through supervision rather than architectural feature gating (Su et al., 28 Jun 2026, Wu et al., 3 Jun 2026, Xiong et al., 2024, Li et al., 2024).
2. Architectural integration patterns
A direct input-level integration strategy appears in SonoCLIP. The model takes a pseudo-color ultrasound image and a binary anatomical mask , processes them through two parallel convolutional stems, and fuses them by addition at the embedding level followed by a depthwise convolution:
The mask branch is initialized to zero weights so that initially and , preserving pretrained CLIP behavior at initialization. The mask enters at the very early convolutional stem of a CLIP ViT-L/14@336px visual backbone, and the paper explicitly states that the key architectural change is the mask-channel visual pathway itself rather than a large redesign of the transformer or a dedicated region-token head (Su et al., 28 Jun 2026).
A two-pass masked re-encoding strategy appears in longitudinal medical VQA. The current image is first co-registered to the reference image by a shallow CNN that predicts a near-identity affine transform . A shared encoder then extracts token sequences from the registered current image and the reference image. Masks are generated from a frozen DINO prior and a trainable adaptive mask generator, with the final shared mask defined as
The same mask is applied to both visits,
after which the masked images are re-encoded and passed to a GPT-2-based multimodal decoder. Here the MVE is not simply a mask-augmented feature extractor; it is a longitudinal comparison block in which the same saliency hypothesis constrains both time points (Wu et al., 3 Jun 2026).
Hear-Your-Click uses a dual-branch convolutional encoder for object-aware video representation. The video branch receives the element-wise masked video , while a second branch encodes the mask sequence itself. With SlowOnly backbones for both 0 and 1, the final object-level visual embedding is
2
where 3 and 4. The mask therefore acts twice: once by suppressing irrelevant pixels and once by providing an explicit temporal structure stream (Liang et al., 7 Jul 2025).
SegTalker uses a different formulation altogether. The mask-guided encoder is a multi-scale encoder, described as a variation of the pSp-style encoder with a Feature Pyramid Network, that extracts per-region style codes rather than a single global representation. For each feature scale 5 and region 6, a masked global average pooled feature is computed as
7
and pooled features across scales are concatenated and mapped to a region style code:
8
The resulting latent representation is stated as 9, matching the layer-wise modulation structure of the downstream StyleGAN generator (Xiong et al., 2024).
MrRegNet is an important counterexample to any narrow architectural definition. The paper explicitly notes that it does not specify a separate detailed MVE subnetwork. Instead, it uses a multi-resolution encoder-decoder CNN with 0 resolution levels, concatenated source-target input, residual blocks in the encoder, transposed-convolution upsampling in the decoder, and skip concatenation across matching scales. The “mask-guided” aspect is implemented through the supervision pathway: segmentation masks are warped by the current displacement field and compared to target masks at each resolution (Li et al., 2024).
3. Objective functions and supervision regimes
SonoCLIP combines standard global image-text alignment with a region-text objective. For global alignment, it uses the usual symmetric InfoNCE/CLIP loss over normalized image and text embeddings. For region-text alignment, it replaces batch-coupled softmax with a pairwise sigmoid loss inspired by SigLIP:
1
2
with a symmetric text-to-image term and final loss
3
The paper states that this formulation is more stable under large-scale and heterogeneous supervision, and uses it to support joint global-local contrastive representation learning with segmentation masks as mask-channel visual prompts (Su et al., 28 Jun 2026).
The longitudinal medical VQA model supplements language-modeling and registration terms with three auxiliary objectives. The registration regularizer encourages near-identity transforms:
4
with
5
Mask rebuilding constrains masked features to match the original features gated by the final mask. Pairwise Gram-style consistency regularizes patch-to-patch relational structure across visits for both full and masked features. KoLeo uniformity discourages feature collapse through a nearest-neighbor dispersion penalty. The paper also specifies auxiliary-loss weights 6 and 7 (Wu et al., 3 Jun 2026).
Hear-Your-Click trains MVE under an audiovisual contrastive objective. After temporal averaging, the model applies a cosine-similarity contrastive loss:
8
with a batch-wise symmetric InfoNCE-like objective over 9 and 0. The visual encoder 1 and audio encoder 2 are initialized from Diff-Foley pretrained weights, 3 is initialized randomly, most of 4 is frozen, the final MLP block of 5 remains trainable, and 6 and 7 are fully trainable (Liang et al., 7 Jul 2025).
MrRegNet uses a multi-resolution registration objective combining image similarity, mask similarity, and smoothness:
8
The source mask is warped by the cumulative displacement field at each scale and compared to the downsampled target mask with a soft Dice loss. The paper states that masks are used only during training, not inference, and that smoothness weights are dynamic across scales, with larger values at coarse levels and smaller values at fine levels (Li et al., 2024).
SegTalker situates the mask-guided encoder inside a broader generative training pipeline. For talking segmentation generation, the paper gives a weighted cross-entropy reconstruction loss and a segmentation-domain SyncNet loss. For the segmentation-guided GAN injection stage, it states that training uses pixel-wise loss, LPIPS loss, identity loss, face parsing loss, and adversarial loss. The text explicitly notes that the exact scalar coefficients for these SGI losses are not provided in the main text excerpt, even though the final objective is described as a weighted combination (Xiong et al., 2024).
4. Inference regimes and controllability
MVE architectures differ sharply in whether masks are externally supplied, internally predicted, or used only during optimization. SonoCLIP supports both standard global inference and mask-guided inference. At test time it can take a user-provided mask or a predicted mask together with the ultrasound image, produce a local region representation, and compare that region embedding against text embeddings of candidate descriptions. The paper emphasizes that global inference matches the whole image to plane descriptions, whereas mask-guided inference focuses on the anatomy inside the mask (Su et al., 28 Jun 2026).
In the longitudinal VQA setting, the mask is neither purely external nor merely supervisory. A frozen DINO-based mask generator and a trainable adaptive mask generator produce a shared mask that is applied to both the current and reference images before re-encoding. The decoder then receives a multimodal prefix formed by sequential concatenation of masked main-image tokens, masked reference-image tokens, and question tokens:
9
This produces intrinsic interpretability through the shared saliency mask while tightly coupling region selection and answer generation (Wu et al., 3 Jun 2026).
Hear-Your-Click is explicitly interactive. The framework enables users to generate sounds for specific objects in videos by simply clicking on the frame. In the broader generation pipeline, the MVE output is combined with per-frame CLIP embeddings as
0
and this combined condition guides the latent diffusion model for audio generation. The object mask therefore functions as a user interface, a visual selector, and a conditioning variable for generation (Liang et al., 7 Jul 2025).
SegTalker uses masks for both semantic control and local editing. The generated talking segmentation controls spatial structure, while per-region style codes extracted from the source portrait preserve textures such as skin, lips, teeth, hair, eyebrows, and background. By editing the mask and swapping the region textures from a given reference image, the framework enables local editing including hair editing, eyebrow editing, lip makeup, blinking, and background replacement (Xiong et al., 2024).
MrRegNet occupies the opposite end of the spectrum. Masks guide the network during training by shaping the objective at every resolution level, but they are not required during inference. This distinction is central: in MrRegNet, mask guidance improves the learned registration function without turning mask input into a runtime control variable (Li et al., 2024).
5. Empirical evidence across application domains
The strongest quantitative evidence for explicit runtime mask guidance appears in SonoCLIP. On zero-shot classification on FetalP24 (Center B), the paper reports Top-1/Top-5 scores of 10.52/29.59 for CLIP, 16.69/50.34 for UniMed-CLIP, 39.78/83.25 for FetalCLIP, 58.38/94.47 for SonoCLIP (w/o mask), and 85.01/99.01 for SonoCLIP (w/ mask). The paper further states that, relative to FetalCLIP, the domain-adapted SonoCLIP improves by +18.60 Top-1 and +11.22 Top-5, and that mask-guided inference yields an additional +26.63 Top-1 and +4.54 Top-5 over SonoCLIP without masks. On FetalP6 linear probing, average accuracy and average F1 are reported as 87.1/85.8 for CLIP, 83.5/82.7 for UniMed-CLIP, 94.6/92.3 for FetalCLIP, 96.3/95.3 for SonoCLIP (w/o mask), and 99.3/98.8 for SonoCLIP (w/). On FetalP5 segmentation, average Dice and average IoU for SonoCLIP (w/o mask) are reported as 87.2 and 80.5, exceeding CLIP, UniMed-CLIP, and FetalCLIP in the listed comparison (Su et al., 28 Jun 2026).
The longitudinal medical VQA paper gives a direct ablation on attention masks. On Medical-Diff-VQA, the full method achieves BLEU-1 0.747, BLEU-2 0.620, BLEU-3 0.510, BLEU-4 0.425, METEOR 0.700, ROUGE-L 0.703, and CIDEr 2.011. Removing mask guidance reduces performance to BLEU-1 0.706, BLEU-2 0.584, BLEU-3 0.480, BLEU-4 0.399, METEOR 0.699, ROUGE-L 0.680, and CIDEr 1.844. The same study reports that removing frozen image-encoder warm-up reduces CIDEr from 2.011 to 1.714 and that removing the DINO-inspired unsupervised objectives reduces CIDEr to 1.765 (Wu et al., 3 Jun 2026).
SegTalker reports quantitative results on HDTF for the full pipeline: Sync 6.872, F-LMD 3.405, M-LMD 3.173, FID 10.348, LPIPS 0.0494, PSNR 33.590, SSIM 0.934, and FVD 9.205, with the table noting that FVD is shown as a scaled value 1. Ablations further report, for example, that removing prior learning yields Sync 5.314, FID 28.254, PSNR 28.685, and SSIM 0.847; removing cross entropy yields Sync 3.126, FID 63.264, PSNR 14.257, and SSIM 0.479; and removing SyncNet yields Sync 4.174, FID 10.647, PSNR 32.036, and SSIM 0.913 (Xiong et al., 2024).
Hear-Your-Click introduces a new evaluation metric, the CAV score, and states that extensive experiments demonstrate more precise control and improved generation performance across various metrics. The detailed numerical results are not included in the present data block, but the paper-level claim is that object-aware masking improves audiovisual correspondence and interactive controllability in complex scenes (Liang et al., 7 Jul 2025).
MrRegNet reports that the proposed method outperforms traditional methods like Demons and the deep learning method VoxelMorph on the OASIS public 3D brain MRI dataset and a local 2D brain MRI dataset with large deformations, with image alignment accuracies significantly improved at local regions guided by segmentation masks. The present data block does not provide the numerical table, but the reported outcome is specifically tied to local region alignment under mask-guided supervision (Li et al., 2024).
6. Conceptual implications, misconceptions, and limits
A common misconception is that MVE always means concatenating a mask to the image at the encoder input. SonoCLIP does use an input-stage mask-channel visual pathway, but the other papers contradict any single architectural reading. The longitudinal VQA model computes a shared mask and re-encodes masked image pairs; Hear-Your-Click uses dual-branch object-aware encoding with masked video and mask streams; SegTalker performs region-wise masked pooling to produce style codes; and MrRegNet does not define a separate detailed MVE subnetwork at all, relying instead on mask-guided loss terms during training (Su et al., 28 Jun 2026, Wu et al., 3 Jun 2026, Liang et al., 7 Jul 2025, Xiong et al., 2024, Li et al., 2024).
A second misconception is that mask guidance necessarily implies inference-time controllability. This is true for SonoCLIP, where a user-provided or predicted mask can guide zero-shot inference, and for Hear-Your-Click, where a user click specifies the object whose sound is to be generated. It is also true for SegTalker, where masks and region style codes support local editing. It is not true for MrRegNet, where ROI masks are only used during training and are not required during inference (Su et al., 28 Jun 2026, Liang et al., 7 Jul 2025, Xiong et al., 2024, Li et al., 2024).
The available papers also point to characteristic limits. The longitudinal VQA work notes that some irrelevant areas may still leak into the mask. SegTalker’s main-text excerpt does not provide exact scalar coefficients for all SGI loss terms. MrRegNet does not specify a separate detailed MVE architecture in the same way other papers do. These are not contradictions; they indicate that “mask-guided visual encoder” is a flexible label whose exact meaning depends on whether the mask is an input prompt, a learned latent selector, a supervision signal, or a semantic partition (Wu et al., 3 Jun 2026, Xiong et al., 2024, Li et al., 2024).
A plausible implication is that the unifying function of MVE is not a specific block diagram but a specific inductive bias: local relevance is enforced early enough, or strongly enough, that downstream learning no longer depends solely on global visual summaries. The empirical patterns reported in fetal ultrasound, longitudinal VQA, talking-face generation, interactive video-to-audio generation, and image registration are consistent with that interpretation, although each paper operationalizes the bias through a different mechanism (Su et al., 28 Jun 2026, Wu et al., 3 Jun 2026, Liang et al., 7 Jul 2025, Xiong et al., 2024, Li et al., 2024).