Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hear-Your-Click: Click-Driven V2A Generation

Updated 4 July 2026
  • Hear-Your-Click is an interactive video-to-audio framework that uses user clicks for precise object segmentation and sound generation.
  • It leverages Object-aware Contrastive Audio-Visual Fine-tuning and a Mask-guided Visual Encoder to align object-specific features with the generated audio.
  • The framework employs a latent diffusion model with quantitative improvements in fidelity and object correspondence compared to global scene methods.

Hear-Your-Click is an interactive video-to-audio (V2A) generation framework that reformulates V2A from global scene conditioning into object-conditioned sound generation: a user clicks on a frame, the selected object is segmented and tracked, and the system generates audio for that object rather than for the scene as a whole (Liang et al., 7 Jul 2025). The framework combines Object-aware Contrastive Audio-Visual Fine-tuning (OCAV), a Mask-guided Visual Encoder (MVE), object-sensitive augmentations, and a Latent Diffusion Model (LDM), and is trained on VGG-AnimSeg, a dataset derived from VGGSound for object-level audio-visual learning (Liang et al., 7 Jul 2025).

1. Task definition and reformulation of video-to-audio generation

Hear-Your-Click begins from the observation that conventional V2A methods condition on full-frame video features and therefore struggle in multi-object or visually cluttered scenes. In that regime, the generated audio may be semantically plausible at the scene level while remaining weakly tied to the object a user actually wants to sonify. Hear-Your-Click addresses this by changing the task from global scene-to-sound generation to a masked, object-aware mapping (Liang et al., 7 Jul 2025).

The paper defines the video input as

VRT×H×W×3,\mathcal{V}\in\mathbb{R}^{T\times H\times W\times 3},

the target object through a sequence of binary masks

M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},

and the output as a Mel spectrogram

ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.

This yields an explicit object-conditioned formulation from (V,M)(\mathcal{V},\mathcal{M}) to A\mathcal{A} (Liang et al., 7 Jul 2025).

A central distinction in the paper is between training and inference. During training, the object is specified through human-labeled textual prompts. During inference, it is specified by user clicks. This makes Hear-Your-Click both a generative model and an interaction model: the click supplies the target identity operationally, while the masks provide the object-level conditioning needed for generation (Liang et al., 7 Jul 2025).

The practical motivation is object specificity. The intended use cases include film production, editing, accessibility, and interactive media, where one may want the barking dog rather than ambient scene sound, or the hoofbeats of cows in a selected region rather than the aggregate audio implied by the whole frame. This suggests that Hear-Your-Click is best understood as a user-steerable V2A system rather than a generic scene sonification model.

2. Interactive inference pipeline and click-driven control

At inference time, the system operates as an object-selection pipeline followed by conditional audio generation. The user uploads a silent video, selects a single frame, and clicks on the desired object or region. Segment Anything Model (SAM) produces an initial mask from the click, and the user can refine it interactively. Track Anything Model (TAM) then propagates that mask through the full sequence, yielding the mask set M\mathcal{M} over time (Liang et al., 7 Jul 2025).

Once the mask sequence is available, Hear-Your-Click extracts two forms of visual conditioning. The first is an object-aware feature xv\boldsymbol{x}_v computed with MVE. The second is a masked-frame CLIP feature xv\boldsymbol{x}_v^*. These are combined and passed to the LDM to generate the final audio track (Liang et al., 7 Jul 2025). The system therefore does not merely produce a brief synchronized sound effect; in experiments it generates an 8.2-second audio clip for each test video, while training videos are 10 seconds long (Liang et al., 7 Jul 2025).

The interaction design is deliberately minimal. A click identifies what should sound, while the model determines how that object should sound from the learned audio-visual correspondence. This differs from region-agnostic V2A conditioning, where the model must infer salience from the whole scene. It also differs from text-only prompting, because the spatial target is selected directly rather than described indirectly.

Operational details are explicit. The system uses 4 fps video resized to 224×224224\times 224. For segmentation, SAM uses IoU threshold $0.88$ and NMS threshold M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},0, and TAM uses 15 voting frames. At inference, the diffusion sampler is DPM-Solver with 50 steps, Classifier-Free Guidance Scale M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},1, and Classifier-Guidance Scale M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},2 (Liang et al., 7 Jul 2025).

3. Object-aware Contrastive Audio-Visual Fine-tuning and the Mask-guided Visual Encoder

The representational core of Hear-Your-Click is OCAV, which adapts contrastive audio-visual pretraining to the object level. The paper argues that standard Contrastive Audio-Visual Pre-training (CAVP) is effective for global synchronization but tends to overlook fine-grained object details. OCAV instead learns from video-mask-audio triplets M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},3, so that the selected object, rather than the entire frame, is aligned with the corresponding audio (Liang et al., 7 Jul 2025).

MVE implements this object-centric encoding with a dual-branch structure. The masked-video branch computes

M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},4

where M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},5 denotes elementwise masking. The mask branch computes

M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},6

The final visual representation is

M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},7

with M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},8 (Liang et al., 7 Jul 2025). The masked RGB stream supplies appearance and motion information for the selected object, while the raw mask stream contributes spatial-temporal extent.

On the audio side, the encoder is

M={M1,M2,,MT},Mt{0,1}H×W,\mathcal{M}=\{\mathcal{M}_1,\mathcal{M}_2,\ldots,\mathcal{M}_T\}, \qquad \mathcal{M}_t\in\{0,1\}^{H\times W},9

with ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.0 (Liang et al., 7 Jul 2025). Clip-level embeddings are obtained by temporal averaging, and training uses a symmetric InfoNCE-style contrastive objective over image-to-audio and audio-to-image retrieval directions. The similarity function is cosine similarity: ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.1 The training setup initializes ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.2 and ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.3 from Diff-Foley pretraining, randomly initializes ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.4, freezes most of ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.5 except its final MLP block, and keeps ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.6 and ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.7 trainable (Liang et al., 7 Jul 2025).

Two augmentations are added to make OCAV more object-sensitive. Random Video Stitching (RVS) creates synthetic multi-object scenes by stitching frames from two videos and overlapping the audio tracks. The stated goal is to force the model to identify individual objects rather than collapse to a single global interpretation. The paper notes that excessive RVS can cause overfitting and hurt generalization, so augmentation is applied to selected subsets rather than indiscriminately (Liang et al., 7 Jul 2025).

Mask-guided Loudness Modulation (MLM) enforces a synthetic relationship between object visibility and audio loudness. For frame ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.8, the unmasked-area ratio is

ART×N.\mathcal{A}\in\mathbb{R}^{T' \times N}.9

which is normalized by the maximum over the clip: (V,M)(\mathcal{V},\mathcal{M})0 The normalized sequence is linearly interpolated to the audio length, producing loudness scaling factors (V,M)(\mathcal{V},\mathcal{M})1, which are multiplied elementwise with the audio signal (Liang et al., 7 Jul 2025). This pushes the model to associate larger or more visible masked regions with louder audio.

4. Diffusion backbone, conditioning, and dataset construction

Hear-Your-Click inherits its latent encoder and decoder from Stable Diffusion v1.4 and adapts the LDM framework to audio spectrogram generation. The object-aware conditioning is

(V,M)(\mathcal{V},\mathcal{M})2

where (V,M)(\mathcal{V},\mathcal{M})3 comes from MVE and (V,M)(\mathcal{V},\mathcal{M})4 from CLIP on masked frames (Liang et al., 7 Jul 2025). The stated rationale is complementary: MVE supplies object-focused temporal features, while CLIP contributes higher-level semantics that help disambiguate objects with similar motion but different sounds.

The target spectrogram (V,M)(\mathcal{V},\mathcal{M})5 is encoded into a latent

(V,M)(\mathcal{V},\mathcal{M})6

and the forward diffusion process is

(V,M)(\mathcal{V},\mathcal{M})7

(V,M)(\mathcal{V},\mathcal{M})8

Training uses the standard noise-prediction objective

(V,M)(\mathcal{V},\mathcal{M})9

and the decoded spectrogram is A\mathcal{A}0 (Liang et al., 7 Jul 2025).

The training data are built through VGG-AnimSeg, a derived dataset from VGGSound focusing on animal-related videos to obtain cleaner one-to-one object-sound relationships. The paper selects 68 classes, uses CLAP and CLIP embeddings to measure audio-text and image-text cosine similarities, and keeps the top 400 training and 40 test samples per textual description by average similarity, yielding about 30,000 samples. In the experiments section, the final dataset is reported as 27,200 training samples and 2,720 test samples; with RVS, the training size rises to 34,000 (Liang et al., 7 Jul 2025).

Training details are split by stage. During OCAV training, the system samples 4-second synchronized segments with A\mathcal{A}1 and A\mathcal{A}2, yielding A\mathcal{A}3. During LDM training, MVE features from full 10-second videos yield A\mathcal{A}4, and per-frame CLIP embeddings yield A\mathcal{A}5 (Liang et al., 7 Jul 2025).

5. Quantitative performance, ablations, and empirical behavior

Hear-Your-Click is evaluated against SpecVQGAN, Im2Wav, Seeing and Hearing (SAH), and Diff-Foley using Frechet Distance (FD), Frechet Audio Distance (FAD), Inception Score (IS), Kullback-Leibler Divergence (KL), Kernel Inception Distance (KID), and the newly introduced CAV score, which measures audio-visual correspondence in a shared C-MCR representation space (Liang et al., 7 Jul 2025).

In the main comparison, the best Hear-Your-Click variant with MVE+CLIP conditioning, denoted “OursA\mathcal{A}6,” reports FD A\mathcal{A}7, FAD A\mathcal{A}8, IS A\mathcal{A}9, KL M\mathcal{M}0, KID M\mathcal{M}1, and CAV M\mathcal{M}2. The corresponding Diff-Foley numbers are FD M\mathcal{M}3, FAD M\mathcal{M}4, IS M\mathcal{M}5, KL M\mathcal{M}6, KID M\mathcal{M}7, and CAV M\mathcal{M}8. The MVE-only variant, “OursM\mathcal{M}9,” reports FD xv\boldsymbol{x}_v0, FAD xv\boldsymbol{x}_v1, KL xv\boldsymbol{x}_v2, KID xv\boldsymbol{x}_v3, and CAV xv\boldsymbol{x}_v4 (Liang et al., 7 Jul 2025). These results indicate that Hear-Your-Click improves most reported fidelity and correspondence metrics relative to the listed global-conditioning baselines, while not being best on IS.

The ablations are particularly informative. In the visual-feature ablation, conditioning the LDM on MVE alone yields FD xv\boldsymbol{x}_v5, FAD xv\boldsymbol{x}_v6, IS xv\boldsymbol{x}_v7, KL xv\boldsymbol{x}_v8, KID xv\boldsymbol{x}_v9, and CAV xv\boldsymbol{x}_v^*0. MVE+CLIP further improves several fidelity metrics to FD xv\boldsymbol{x}_v^*1, KL xv\boldsymbol{x}_v^*2, and KID xv\boldsymbol{x}_v^*3, but lowers CAV to xv\boldsymbol{x}_v^*4 (Liang et al., 7 Jul 2025). This suggests that CLIP semantics can improve generation quality while not necessarily maximizing the paper’s object-correspondence metric.

The MLM and RVS ablation isolates the object-sensitive training heuristics. A base model without either augmentation reports FD xv\boldsymbol{x}_v^*5, FAD xv\boldsymbol{x}_v^*6, KL xv\boldsymbol{x}_v^*7, KID xv\boldsymbol{x}_v^*8, and CAV xv\boldsymbol{x}_v^*9. Adding MLM alone improves these to FD 224×224224\times 2240, FAD 224×224224\times 2241, KL 224×224224\times 2242, KID 224×224224\times 2243, and CAV 224×224224\times 2244. RVS alone increases FD to 224×224224\times 2245 and KID to 224×224224\times 2246, and the combined MLM+RVS setting reports FD 224×224224\times 2247, FAD 224×224224\times 2248, KL 224×224224\times 2249, KID $0.88$0, and CAV $0.88$1 (Liang et al., 7 Jul 2025). The paper explicitly states that RVS can hurt aggregate metrics, possibly because resizing in stitched videos distorts aspect ratios, while still improving handling of multi-object scenes qualitatively.

Qualitative examples are used to illustrate the intended behavior. For “The dog growls and suddenly barks loudly,” the reported claim is that the generated sound is close to the ground truth while preserving synchronization with transient motion. For “Herds of cattle pass by,” the generated volume diminishes as cows run away, which the paper attributes to sensitivity to temporal mask changes and distance cues (Liang et al., 7 Jul 2025).

6. Scope, limitations, and neighboring research uses of the phrase

Hear-Your-Click depends critically on mask quality. The conditioning signal is produced by SAM, TAM, and, in data preparation, DEVA; the paper therefore identifies segmentation quality as a direct limitation. Poor masks corrupt object features and degrade generation. The method also inherits ambiguity when multiple objects have similar appearance or motion, and temporal consistency is only partially handled: MLM improves amplitude trends tied to visibility, but the model still operates at relatively coarse frame rates and does not deeply model long-range sound structure beyond diffusion conditioning (Liang et al., 7 Jul 2025).

The dataset is intentionally biased toward animal-related VGGSound clips because the authors sought cleaner one-to-one sound-source relations. This improves supervision quality but limits demonstrated open-domain generality. RVS further reveals a trade-off between controllability and standard generation metrics. The paper does not report a user study, does not provide a single joint loss for OCAV and LDM, and does not discuss training cost in GPU-hours or inference latency in seconds, although it does provide operational details such as 50 DPM-Solver steps and interactive SAM/TAM segmentation (Liang et al., 7 Jul 2025).

The phrase “Hear-Your-Click” also resonates with several adjacent research lines. In spectral geometry, the “Hear-Your-Click” question corresponds to whether pointwise reverberation determines location on a manifold; for compact Riemannian surfaces without boundary, pointwise spectral data being constant across points is equivalent to transitive isometry-group action (Wang et al., 2023). In mobile security, permissionless smartphone barometer data can detect touchscreen finger taps with 100% accuracy in segmented 2-second windows and can reveal coarse information about tap position (Hafez et al., 2020). In robot manipulation, training-time button instrumentation can supervise a fingertip-microphone audio model into a click detector with validation $0.88$2 and $0.88$3 false negatives, reducing contact force during button pressing even when success rates remain similar across methods (Talwar et al., 7 Apr 2026). A separate click-conditioned generative line, “Follow-Your-Click,” uses a user click and a short motion prompt for open-domain regional image animation rather than V2A generation (Ma et al., 2024).

Within that broader landscape, Hear-Your-Click denotes a specific shift in multimodal generation: object-aware, click-driven V2A in which a user-selected region is segmented, tracked, aligned with sound through OCAV, and used to condition latent diffusion. Its importance lies less in generic audio realism than in changing the control interface of V2A from scene-level prompting to explicit object selection (Liang et al., 7 Jul 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hear-Your-Click.