Sketch2Sound: Controllable Audio Synthesis

Updated 3 July 2026

Sketch2Sound is a generative audio model that produces high-quality sounds using text prompts paired with interpretable time-varying controls such as loudness, brightness, and pitch.
It leverages a lightweight architecture by fine-tuning a pre-trained text-to-audio diffusion model with a single linear layer per control, ensuring precise 40 Hz temporal alignment.
Quantitative evaluations show significant reductions in control errors, highlighting trade-offs between low-level fidelity and high-level semantic adherence.

Sketch2Sound denotes a class of controllable audio-generation methods in which a sound is specified not only by a semantic prompt but also by sketchlike control signals. In its most explicit formulation, Sketch2Sound is a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals—loudness, brightness, and pitch—as well as text prompts, and can synthesize arbitrary sounds from sonic imitations such as a vocal imitation or a reference sound-shape (García et al., 2024). The term also sits within a broader research program on sound–shape associations, free-form sound sketching, and communicative sound depiction, where the central question is whether human-produced sketches or sonic gestures can function as usable, low-bandwidth control interfaces for synthesis, editing, retrieval, and separation.

1. Definition and problem setting

Sketch2Sound addresses controllable sound generation under a division of labor between two kinds of conditions. Text prompts specify the semantic identity of the sound, such as an event class, object, or environment, while time-varying controls specify how the sound should evolve over time. In the core formulation, text answers the question of what kind of sound to make, whereas loudness, brightness, and pitch describe when the sound gets louder or quieter, brighter or duller, or more or less pitched over its duration (García et al., 2024).

This design is motivated by use cases such as sound design and Foley, where producing a plausible class label is insufficient. A designer may need a whoosh that rises in pitch, an impact with a shaped attack, or an ambience whose bursts and decays follow a performed gesture. The method therefore treats sonic imitation as a sketch rather than as a waveform to be copied. A vocal imitation, tapping pattern, or other reference sound is reduced to a control sketch: a time-varying abstract description of loudness, brightness, and pitch that guides generation without requiring the output to inherit the timbre of the imitation itself (García et al., 2024).

The same conceptual split appears in adjacent work. In expressive speech synthesis, DrawSpeech introduces user-drawn prosodic sketches as high-level conditioning signals that indicate only coarse pitch and energy trends, while a learned model reconstructs detailed realizable contours before final synthesis. This suggests that sketch-conditioned generation is not limited to environmental or Foley-style sound effects, but can also structure fine-grained temporal control in speech (Chen et al., 8 Jan 2025).

2. Architecture and conditioning pathway

The canonical Sketch2Sound architecture is a lightweight adapter on top of a pre-trained text-to-audio latent diffusion transformer. The base model operates on 48 kHz mono audio, uses a VAE to compress audio into latents at 40 Hz, and represents each latent frame with dimensionality 64 (García et al., 2024). At this frame rate, each control frame corresponds to $25$ ms, so the controls operate on a temporally meaningful grid without acting directly at sample rate.

Let the noisy latent sequence be $\mathbf{z} \in \mathbb{R}^{D \times N}$ , and let a time-varying control be $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ . Sketch2Sound adds a trainable linear projection $p_\theta$ per control and conditions the model by additive fusion: $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ The method can repeat this process for any number of controls, so the full conditioned latent is the noisy latent plus the sum of projected control sequences (García et al., 2024).

This design is deliberately minimal. The model requires only a single linear layer per control signal and approximately 40k steps of fine-tuning, and is presented as more lightweight than existing methods like ControlNet. No auxiliary reconstruction loss is imposed on the control curves themselves; the training loss remains the same denoising diffusion objective used for the original text-to-audio model (García et al., 2024).

Temporal alignment is explicit. Because the VAE latents operate at 40 Hz, the control signals must be extracted at the same frame rate or interpolated to that rate. The conditioning pathway therefore preserves framewise temporal structure instead of tokenizing controls into a separate symbolic stream (García et al., 2024).

3. Control signals, sonic imitations, and sketch abstraction

Sketch2Sound uses three interpretable control families. Loudness is extracted by computing a magnitude spectrogram, applying an A-weighted sum across frequency bins, and taking the RMS of the result. Brightness is represented by spectral centroid, then converted from linear frequency into a continuous MIDI-like representation and divided by $127$, which was reported to stabilize early training. Pitch is represented not by a single scalar estimate but by the raw pitch probabilities from the CREPE tiny model; probabilities below $0.1$ are zeroed out to suppress low-confidence structure (García et al., 2024).

These controls can be obtained from an input sonic imitation. A user records a vocal imitation or other sound-shape; the system extracts loudness, brightness, and pitch probability contours; and those contours become the conditioning sketch used together with text. The generated output is expected to follow the gesture of the imitation while sounding like the text-described source rather than like the input voice (García et al., 2024).

A central innovation is the treatment of imitations as sketches rather than exact targets. During training, Sketch2Sound applies random median filters to the control signals, with window sizes sampled from $1$ to $25$ control frames. Since one frame equals $25$ ms, the effective window range is $\mathbf{z} \in \mathbb{R}^{D \times N}$ 0 ms to $\mathbf{z} \in \mathbb{R}^{D \times N}$ 1 ms. At inference, the paper studies filter sizes $\mathbf{z} \in \mathbb{R}^{D \times N}$ 2, enabling explicit control over the trade-off between temporal specificity and plausibility (García et al., 2024).

Conditioning dropout is also used to permit flexible input subsets. Each control is dropped independently with probability $\mathbf{z} \in \mathbb{R}^{D \times N}$ 3, text conditioning is dropped with probability $\mathbf{z} \in \mathbb{R}^{D \times N}$ 4, and all signals are dropped together with probability $\mathbf{z} \in \mathbb{R}^{D \times N}$ 5. Inference uses two-conditioning classifier-free guidance with a text guidance scale $\mathbf{z} \in \mathbb{R}^{D \times N}$ 6 and a shared control guidance scale $\mathbf{z} \in \mathbb{R}^{D \times N}$ 7, with the recommended default $\mathbf{z} \in \mathbb{R}^{D \times N}$ 8 and $\mathbf{z} \in \mathbb{R}^{D \times N}$ 9 (García et al., 2024).

4. Training protocol and quantitative evaluation

The base text-to-audio model is pre-trained on proprietary licensed sound-effect datasets and publicly available CC-licensed general audio datasets. Sketch2Sound itself is then initialized from the text-only baseline checkpoint and fine-tuned for 40k steps. Evaluation centers on VimSketch, a dataset of approximately 12k vocal imitations, each paired with a vocal imitation, a text description, and a reference sound. For each model variant, the paper generates 10k examples up to 5 seconds long using vocal imitation plus text description as conditioning (García et al., 2024).

The evaluation protocol separates three axes. Text adherence is measured by CLAP audio-text cosine similarity. Audio quality is measured by Fréchet Audio Distance with VGGish and CLAP embeddings, using a proprietary reference set of 40k high-quality sound effects. Control adherence is measured as $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ 0 error between input and generated controls on valid frames only: loudness in dB, centroid in semitones, pitch in semitones, chroma in semitones, and periodicity (García et al., 2024).

Variant	Control adherence	Semantics and quality
Text-only baseline	loudness 13.41, centroid 10.34, pitch 13.91, chroma 2.96	CLAP 0.273, VGGish FAD 2.57, CLAP FAD 0.27
Loudness + centroid + pitch	loudness 3.60, centroid 4.43, pitch 1.49, chroma 0.48	CLAP 0.211, VGGish FAD 2.51, CLAP FAD 0.312
Full model with no filters	loudness 1.87, centroid 3.21, pitch 0.45, chroma 0.21	CLAP 0.152, VGGish FAD 3.53, CLAP FAD 0.379

The incremental ablation confirms that each added control improves adherence to the corresponding temporal structure. Loudness-only conditioning reduces loudness error from $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ 1 to $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ 2; adding centroid reduces centroid error to $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ 3; and the full model reduces pitch error to $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ 4 and chroma error to $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ 5 (García et al., 2024).

5. Controllability, trade-offs, and limitations

The central empirical trade-off is between low-level control fidelity and high-level plausibility. Relative to the text-only baseline, the full model reduces loudness error from $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ 6 to $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ 7, centroid error from $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ 8 to $\mathbf{c}_{\mathrm{ctrl}} \in \mathbb{R}^{K \times N}$ 9, pitch error from $p_\theta$ 0 to $p_\theta$ 1, and chroma error from $p_\theta$ 2 to $p_\theta$ 3, while text adherence decreases from $p_\theta$ 4 to $p_\theta$ 5 (García et al., 2024). Audio quality remains comparable: VGGish FAD slightly improves from $p_\theta$ 6 to $p_\theta$ 7, while CLAP FAD worsens modestly from $p_\theta$ 8 to $p_\theta$ 9.

The filter ablation is especially informative. Low-pass filtering and no filtering enforce controls more strictly than median filtering, but they degrade text adherence and FAD. The full model with median filtering obtains CLAP text adherence $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ 0, VGGish FAD $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ 1, and CLAP FAD $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ 2, whereas low-pass filtering yields $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ 3, $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ 4, and $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ 5, and no filtering yields $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ 6, $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ 7, and $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ 8 (García et al., 2024). This supports the thesis that vocal imitation is often better treated as a sketch than as an exact target trajectory.

Qualitative examples reinforce that the controls can structure semantics rather than merely acoustics. With the prompt “forest ambience,” bursts of loudness in the control can induce bird sounds even without explicit mention of birds. With “bass drum, snare drum,” pitched regions can become bass drum and unpitched regions snare drum. This suggests that the model learns correlations between temporal control patterns and event structure from data (García et al., 2024).

Several limitations are explicit or strongly implied. Strict control matching can make outputs speech-like. Better vocal imitations produce better outputs. Centroid control can carry over room tone or recording environment information. Controls operate at $\mathbf{z}_{\mathrm{ctrl}} = \mathbf{z} + p_\theta(\mathbf{c}_{\mathrm{ctrl}}).$ 9 Hz rather than sample rate, examples are limited to 5 seconds, and pitch reliability depends on CREPE periodicity estimates (García et al., 2024). A plausible implication is that Sketch2Sound is best suited to gestural, event-shaped audio rather than to domains requiring exact microstructure or long-form coherence.

The broader Sketch2Sound idea rests on older evidence that free-form sketching carries some reproducible sonic information. “Sketching sounds: an exploratory study on sound-shape associations” reports that the development of a synthesiser exploiting sound-shape associations is feasible, with acute angles, intersections, and number of strokes emerging as especially informative sketch features for timbral attributes such as hardness, roughness, sharpness, warmth, and depth (Löbbers et al., 2021). “Seeing Sounds, Hearing Shapes” then shows that participants recognized sound–sketch matches above the random baseline, while also concluding that it is difficult to visually encode nuanced timbral differences of similar sounds (Löbbers et al., 2022).

A second strand replaces visual sketching with vocal depiction. “Sketching With Your Voice: ‘Non-Phonorealistic’ Rendering of Sounds via Vocal Imitation” treats vocal imitation as the auditory analogue of sketching and formalizes it with speaker and listener models $127$0 and $127$1. Its main empirical result is that communicative reasoning improves alignment with human vocal imitation strategies beyond direct feature matching (Caren et al., 2024). This line is conceptually close to Sketch2Sound’s use of sonic imitations as high-level control rather than literal reproduction.

The same sketch-conditioned logic appears in speech. DrawSpeech uses user-drawn pitch and energy sketches, recovers detailed phoneme-level contours, and conditions a latent diffusion acoustic model so that coarse sketches become realizable expressive speech. This suggests that the coarse-to-fine pattern of sketch extraction, structured recovery, and diffusion-based final generation is portable across audio domains (Chen et al., 8 Jan 2025).

Sketch2Sound has also been reused downstream. PromptSep incorporates Sketch2Sound as a data augmentation strategy for generative audio separation, using around 12K real vocal imitation samples from VimSketch to generate around 87K temporally aligned sound effect samples, released as the 87,171-pair VimSketchGen dataset. In that setting, raw vocal imitation conditioning on VimSketchGen-Mix achieves SDRi $127$2, L2 Mel $127$3, F1 Decision Error $127$4, CLAPScore$127$5 $127$6, and FAD $127$7, outperforming an ablation that conditions only on pitch and RMS curves (Wen et al., 6 Nov 2025). This suggests that Sketch2Sound functions not only as an overview interface but also as a practical mechanism for building aligned multimodal training data.

Taken together, these works place Sketch2Sound at the intersection of controllable generation, sonic imitation, and cross-modal sketch research. Its distinctive contribution is to make sketchlike temporal control operational inside a text-to-audio diffusion model with minimal additional machinery, while the surrounding literature indicates both why such control is plausible and where its limits remain.