OCAV: Object-aware Contrastive Audio-Visual Fine-Tuning

Updated 4 July 2026

The paper introduces masked, object-aware contrastive fine-tuning to improve audio-visual correspondence by aligning object masks with corresponding audio signals.
It utilizes a Mask-guided Visual Encoder and targeted augmentations like Random Video Stitching and Mask-guided Loudness Modulation to extract precise object-level features.
Empirical results indicate that OCAV enhances fidelity and object-specific control, enabling more accurate audio generation in complex visual scenes.

Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) denotes a contrastive fine-tuning strategy that aligns object-level visual evidence with corresponding audio, rather than aligning whole videos with whole audio clips. In the supplied literature, the term is instantiated most directly by "Hear-Your-Click" (Liang et al., 7 Jul 2025), where OCAV serves as the object-aware alignment stage for interactive video-to-audio generation: a user selects an object by clicking on a frame, a mask sequence specifies the target object through time, and a Mask-guided Visual Encoder learns object-centric visual features that are contrastively aligned with the paired audio. The motivating claim is that global video representations are insufficient in complex scenes, because they blur foreground and background and often fail to generate audio tailored to specific objects or regions. More broadly, several adjacent works pursue finer-grained audio-visual correspondence through object labels, spatial crops, sounding regions, or detected objects, but they differ in whether object awareness is explicit, whether contrastive learning itself is redesigned, and whether the downstream task is generation, retrieval, localization, classification, or question answering (Nakada et al., 2024).

1. Conceptual basis and historical placement

OCAV emerged against a background in which audio-visual learning was typically organized around clip-level correspondence. Earlier self-supervised objectives such as audio-visual correspondence and audio-visual temporal synchronization learned from whether paired clips came from the same video or the same moment, but they did not force the model to identify where sound originated or which object instance produced it. "Learning Representations from Audio-Visual Spatial Alignment" reformulated the supervision signal around viewpoint-specific matching in 360° video and spatial audio, thereby moving beyond video-level matching toward spatially specific alignment; however, its unit of supervision remained the crop or viewpoint rather than a detected object or explicit object proposal (Morgado et al., 2020).

A related limitation was identified in "DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information" (Nakada et al., 2024). That work argues that contrastive pretraining, even when combined with masked autoencoding as in CAV-MAE, often learns only coarse semantic alignment: it can associate broad categories such as animals or instruments with their sounds, yet still fail at fine-grained discrimination such as dog versus lion or oboe versus flute. The paper’s failure examples—such as a dog image retrieving lion or bird sounds, or an oboe sound retrieving flute or horn visuals—formalize the central OCAV motivation: object awareness is introduced to make cross-modal representations more semantically precise.

Within this trajectory, Hear-Your-Click frames OCAV around a different downstream requirement: not merely recognizing that a clip contains a plausible sound source, but generating sound for a user-specified object or region. Its specific claim is that current video-to-audio methods rely on global video information and therefore struggle with complex scenes, multiple objects, and object-specific control (Liang et al., 7 Jul 2025). This places OCAV at the intersection of fine-grained alignment and controllable generation.

2. Formal definition and core architecture

In Hear-Your-Click, the input video is

$\mathcal{V}\in\mathbb{R}^{T\times H\times W\times 3},$

the target object is represented by a binary mask sequence

$\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$

and the goal is to generate audio

$\mathcal{A}\in\mathbb{R}^{T'\times N},$

where $T'$ is audio time length and $N$ is the mel-bin dimension (Liang et al., 7 Jul 2025). The mask is the structural element that makes the method object-aware: during training it is produced from human-labeled text prompts and automatic segmentation, and at inference it is obtained from user clicks followed by mask propagation.

The central architectural component is the Mask-guided Visual Encoder (MVE). It takes both the masked video $\mathcal{V}\odot\mathcal{M}$ and the mask $\mathcal{M}$ itself. The masked-video branch produces

$\mathbf{x}_{mv}=\mathit{norm}(f_v(\mathcal{V}\odot\mathcal{M})),$

the mask branch produces

$\mathbf{x}_m=\mathit{norm}(f_m(\mathcal{M})),$

and the final object-aware visual feature is

$\mathbf{x}_{v}=\mathit{norm}(\mathbf{x}_{mv}+\mathbf{x}_m).$

The feature sequence is then averaged over time to obtain $\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$ 0 (Liang et al., 7 Jul 2025). The paper’s stated rationale is that masked videos are more stable and clearer than original videos because they suppress background interference, while the explicit mask stream captures where and when the target object is present.

The paired audio is encoded by a convolutional audio encoder:

$\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$ 1

followed by temporal averaging to obtain $\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$ 2 (Liang et al., 7 Jul 2025). Positive pairs are synchronized object-mask/audio pairs from the same training triplet, and negatives are the other samples in the minibatch.

The fine-tuning regime is staged rather than fully end-to-end. The video and audio backbones are initialized from Diff-Foley pretrained weights. The visual encoder $\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$ 3 is mostly frozen except for the final MLP block, the mask encoder $\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$ 4 is trained from scratch, and the audio encoder $\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$ 5 is fully trainable (Liang et al., 7 Jul 2025). After OCAV, a latent diffusion model is trained separately for audio generation, conditioned on visual features.

3. Contrastive objective and object-aware augmentations

The OCAV loss in Hear-Your-Click is a symmetric InfoNCE-style objective:

$\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$ 6

with cosine similarity

$\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$ 7

Unlike prior contrastive audio-visual learning centered on global clip embeddings, the positive visual signal is explicitly mask-grounded and object-specific (Liang et al., 7 Jul 2025).

Two augmentations are designed to strengthen this signal. Random Video Stitching (RVS) composes more complex scenes by selecting another video and stitching the two videos horizontally or vertically frame-by-frame while overlapping their audio tracks. The stated purpose is to improve robustness to multi-object scenes and to force disentanglement of object-specific signals. The paper also notes a caveat: RVS slightly hurts some quantitative metrics, likely because resizing during stitching distorts aspect ratios (Liang et al., 7 Jul 2025).

Mask-guided Loudness Modulation (MLM) ties audio amplitude to the target object’s mask size over time. For frame $\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$ 8, the occupancy ratio is

$\mathcal{M}=\{\mathcal{M}_1,\ldots,\mathcal{M}_T\},\quad \mathcal{M}_t\in\{0,1\}^{H\times W},$ 9

which is normalized as

$\mathcal{A}\in\mathbb{R}^{T'\times N},$ 0

These values are linearly interpolated to match the audio waveform length and then used to modulate loudness elementwise (Liang et al., 7 Jul 2025). The intended supervisory effect is that when the selected object becomes smaller or leaves the frame, its sound becomes quieter or stops. This suggests that OCAV is not only object-selective in space, but also sensitive to temporal object presence.

The OCAV formulation differs materially from several adjacent methods. DETECLAP adds binary cross-entropy losses for pseudo object labels on top of a base contrastive-plus-reconstruction objective, rather than redefining contrastive positives around object masks (Nakada et al., 2024). "Object-aware Sound Source Localization via Audio-Visual Scene Understanding" introduces an Object-aware Contrastive Alignment loss and an Object Region Isolation loss, but its object semantics are derived from MLLM-generated foreground/background captions and similarity-pooled regions rather than from explicit click-conditioned mask sequences (Um et al., 23 Jun 2025). These differences are substantive: in Hear-Your-Click, objectness is provided directly by the mask input.

4. Data construction, training pipeline, and generative coupling

The training corpus for Hear-Your-Click is VGG-AnimSeg, a dataset built from VGGSound. The authors filter VGGSound to animal-related classes, use CLAP and CLIP to select samples whose audio, image, and text are strongly aligned, keep the top 400 training and 40 testing samples per textual description, and use DEVA to produce masks via text-prompted segmentation. The resulting dataset contains roughly 30,000 samples (Liang et al., 7 Jul 2025). This dataset construction is critical because OCAV requires synchronized triplets of video, object masks, and audio rather than only raw paired clips.

The implementation details are concrete. Video clips are resampled from 10 seconds to 4 fps, frames are resized to $\mathcal{A}\in\mathbb{R}^{T'\times N},$ 1, mel spectrograms use 128 bins, and the OCAV training clip length is 4 seconds. During OCAV training, the audio and visual feature shapes are $\mathcal{A}\in\mathbb{R}^{T'\times N},$ 2. For the later latent diffusion stage, the visual conditioning feature has shape $\mathcal{A}\in\mathbb{R}^{T'\times N},$ 3 (Liang et al., 7 Jul 2025). The video encoder $\mathcal{A}\in\mathbb{R}^{T'\times N},$ 4 is SlowOnly, the mask encoder $\mathcal{A}\in\mathbb{R}^{T'\times N},$ 5 is also SlowOnly-based, the audio encoder $\mathcal{A}\in\mathbb{R}^{T'\times N},$ 6 is PANNs, and the latent diffusion backbone uses the Stable Diffusion v1.4 latent encoder/decoder.

OCAV is not the final generative model. After fine-tuning, the latent diffusion model generates Mel spectrograms conditioned on

$\mathcal{A}\in\mathbb{R}^{T'\times N},$ 7

where $\mathcal{A}\in\mathbb{R}^{T'\times N},$ 8 is a CLIP-based per-frame visual embedding from the masked frames (Liang et al., 7 Jul 2025). The paper therefore couples a contrastive representation-learning stage with a separate conditional generation stage. This architecture is narrower than general-purpose object-aware pretraining, but it is also more directly optimized for controllable video-to-audio generation.

The inference-time diffusion settings reported in the paper include classifier-free guidance scale 4.5, classifier guidance scale 50, DPM-Solver sampling, and 50 inference steps. For segmentation, SAM uses IoU threshold 0.88 and NMS threshold 0.8, while TAM uses 15 voting frames (Liang et al., 7 Jul 2025). These details matter because OCAV’s practical effectiveness depends on the mask pipeline remaining stable between training and interaction.

5. Interactive inference, evaluation, and empirical behavior

Hear-Your-Click operationalizes OCAV in an interactive loop. A user uploads a silent video, clicks an object in one frame, obtains a mask from SAM, optionally refines that mask interactively, propagates it through the video with TAM, extracts object-aware features with MVE, and then generates audio with the latent diffusion model conditioned on the object-aware visual representation (Liang et al., 7 Jul 2025). The paper emphasizes single-object selection as the primary interface, though the training augmentations are designed to improve robustness in multi-object scenes.

To evaluate object-aware correspondence, the paper introduces the CAV score. The procedure uses the C-MCR model, computes per-frame image embeddings, computes an audio embedding, averages the image embeddings over frames, and then measures similarity between the average image embedding and the audio embedding (Liang et al., 7 Jul 2025). No explicit formula for CAV is provided in the supplied text, so the metric is defined procedurally rather than analytically here.

On the main comparison table, the object-aware variants improve both conventional audio metrics and the proposed correspondence metric. Diff-Foley reports FD 59.00, FAD 6.60, IS 5.67, KL 3.95, KID 0.015, and CAV 2.22. The MVE-only variant reports FD 49.48, FAD 4.04, IS 5.20, KL 3.01, KID 0.012, and CAV 2.55. The MVE+CLIP variant reports FD 48.78, FAD 5.02, IS 4.49, KL 2.82, KID 0.010, and CAV 2.67 (Liang et al., 7 Jul 2025). The paper interprets these results as evidence that object-aware conditioning improves both fidelity and object-specific correspondence.

The feature ablation is particularly revealing. Using CLIP features yields FD 42.78, FAD 5.73, IS 5.39, KL 3.12, KID 0.015, and CAV 2.55; CAVP features yield FD 47.38, FAD 6.73, IS 6.65, KL 4.14, KID 0.016, and CAV 1.95; MVE yields FD 38.89, FAD 4.03, IS 5.79, KL 3.17, KID 0.012, and CAV 3.06; and MVE+CLIP yields FD 35.41, FAD 4.90, IS 5.93, KL 2.90, KID 0.011, and CAV 2.69 (Liang et al., 7 Jul 2025). The paper’s own summary is that MVE gives the strongest object-conditioned representation, while CLIP contributes high-level semantics.

The augmentation ablation shows that MLM helps consistently, whereas RVS improves harder multi-object cases but can slightly harm some scalar metrics. Without either augmentation, the model reports FD 41.29, FAD 5.22, IS 5.86, KL 3.36, KID 0.012, and CAV 2.28. Adding MLM gives FD 34.67, FAD 4.81, IS 5.53, KL 3.07, KID 0.010, and CAV 2.47. Adding RVS alone gives FD 50.19, FAD 4.87, IS 5.56, KL 3.62, KID 0.018, and CAV 2.30. Using both gives FD 40.11, FAD 4.81, IS 5.37, KL 3.16, KID 0.014, and CAV 2.35 (Liang et al., 7 Jul 2025). Qualitatively, the paper reports sharper dog barks, cattle sounds that diminish as the animals move away, and improved attachment of sound to the selected object rather than the entire scene.

6. Relation to adjacent methods, scope, and limitations

OCAV belongs to a broader family of methods that seek finer-grained audio-visual alignment, but the literature shows multiple distinct routes to that goal. DETECLAP injects pseudo object labels derived from Microsoft CLAP and YOLOv8x-oiv7 into CAV-MAE through auxiliary binary cross-entropy losses; it improves retrieval and classification, but the paper explicitly states that it is not a new object-aware contrastive loss and is best understood as an auxiliary-objective enhancement to contrastive audio-visual masked autoencoding (Nakada et al., 2024). CAV-MAE Sync makes alignment finer in time by treating audio as a temporal sequence aligned with video frames, introducing dedicated global tokens and register tokens; it improves retrieval, classification, and localization, but still does not provide explicit object-level contrastive supervision (Araujo et al., 2 May 2025).

Other works move closer to region- or source-aware learning without becoming object-aware in the strict sense. FNAC addresses false negatives in self-supervised audio-visual source localization by using intra-modal adjacency and localized sounding regions as latent object proxies; its focus is source-aware and region-aware contrastive regularization rather than explicit object proposals (Sun et al., 2023). "Audio-visual scene classification via contrastive event-object alignment and semantic-based fusion" aligns event and object prototypes derived from pretrained AST and ConvNeXt, but the alignment is class-level and pseudo-label based rather than spatially localized (Hou et al., 2022). "Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering" uses detected object features and adaptive contrastive objectives over question-object and audio-object pairs, yet the method is task-specific and supervised for AVQA rather than a general representation-learning framework (Li et al., 2023).

Among the supplied papers, "Object-aware Sound Source Localization via Audio-Visual Scene Understanding" is the closest direct counterpart to an OCAV-style formulation. It adds Object-aware Contrastive Alignment and Object Region Isolation losses, with foreground/background semantics generated by an MLLM and no object boxes or segmentation labels (Um et al., 23 Jun 2025). Hear-Your-Click differs in downstream objective and in the source of object supervision: its object awareness is mediated by a user- or prompt-derived binary mask sequence and serves interactive video-to-audio generation rather than source localization.

The limitations reported or implied for Hear-Your-Click are specific. Performance depends on mask quality; RVS may distort geometry and slightly hurt some metrics; the approach is tailored to cases where a visually segmentable sound source exists; and it does not fully solve dense polyphonic audio or off-screen sound causality (Liang et al., 7 Jul 2025). A plausible implication is that OCAV is most effective when object identity, object visibility, and sound causality are tightly coupled. Where these conditions fail, object-aware fine-tuning may still improve alignment, but the mask-conditioned formulation alone may not be sufficient.

Taken together, the literature positions OCAV as a precise response to a persistent weakness of clip-level audio-visual learning: broad correspondence is not enough when the task requires fine-grained semantic control. In Hear-Your-Click, that response takes the concrete form of mask-grounded contrastive fine-tuning for controllable generation (Liang et al., 7 Jul 2025). In the surrounding literature, cognate methods replace masks with pseudo object labels, spatial viewpoints, localized sounding regions, class prototypes, detected objects, or MLLM-derived foreground/background descriptions. The common theme is the same: audio-visual learning becomes more effective when the supervision signal is forced to resolve not just whether audio and video match, but which object, region, or source is responsible.