Papers
Topics
Authors
Recent
Search
2000 character limit reached

Schrödinger Audio-Visual Editor (SAVE)

Updated 18 March 2026
  • SAVE is a quantum-inspired framework that combines principles of quantum mechanics, machine learning, and signal processing to enable innovative audio-visual editing.
  • It employs the time-dependent Schrödinger equation and Schrödinger Bridge theory to facilitate real-time audio synthesis and object-level audiovisual removal with high fidelity.
  • The system delivers synchronized audiovisual outputs with state-of-the-art performance metrics, offering promising applications in media postproduction and interactive editing.

The Schrödinger Audio-Visual Editor (SAVE) designates two distinct but contemporary research systems, each leveraging principles from quantum mechanics, machine learning, and signal processing for advanced audio-visual modeling and editing. The first SAVE system (Freye et al., 2024) transforms the solution of the time-dependent Schrödinger equation into a novel real-time audio-visual instrument. The second SAVE system (Xu et al., 14 Dec 2025) introduces a unified computational paradigm for object-level audiovisual removal via end-to-end flow-matching in a shared latent space, with broad implications for content editing and synthesis. Both share themes of synchronous audio-visual processing and grounding in physics-inspired mathematical formalism.

1. Mathematical and Physical Foundations

At the core of the original SAVE synthesizer lies the time-dependent Schrödinger equation in one spatial dimension (with units set such that ℏ = m = 1):

iψ(x,t)t  =  [122x2+V(x)]ψ(x,t)i\,\frac{\partial\psi(x,t)}{\partial t} \;=\; \left[ -\frac{1}{2}\frac{\partial^2}{\partial x^2} + V(x) \right] \psi(x,t)

The equation's spatial domain x[x0,x0+L]x \in [x_0, x_0+L] is discretized into NN grid points, with the wavefunction ψjtψ(xj,t)\psi_j^t \approx \psi(x_j, t) stored as a complex array. Numerical evolution employs the split-step (Fourier) method, alternately propagating in potential (V(x)V(x)) and kinetic (momentum) spaces via

ψ(1)(xj)=eiV(xj)Δtψ(xj,t)\psi^{(1)}(x_j) = e^{-i V(x_j) \Delta t}\, \psi(x_j, t)

ψ(xj,t+Δt)=F1{eik22ΔtF{ψ(1)(xj)}}\psi(x_j, t+\Delta t) = \mathcal{F}^{-1} \big\{ e^{-i \frac{k^2}{2} \Delta t}\, \mathcal{F}\{\psi^{(1)}(x_j)\} \big\}

This unconditionally stable algorithm, with accuracy governed by ΔtCΔx2\Delta t \lesssim C \Delta x^2, establishes a dynamic, physics-valid state-space for subsequent mapping to audiovisual signals (Freye et al., 2024).

In the context of audiovisual removal (Xu et al., 14 Dec 2025), SAVE invokes the Schrödinger Bridge formalism for learning a stochastic process that directly matches source to target latent state distributions (X0X1)(X_0 \to X_1) in a shared audio-visual space. The conditional flows are expressed for Gaussian marginals as

ut(XtX0,X1)=(X1X0)+12t2t(1t)(Xtμt),μt=(1t)X0+tX1u_t(X_t|X_0, X_1) = (X_1 - X_0) + \frac{1-2t}{2 t (1-t)} (X_t - \mu_t), \quad \mu_t = (1-t) X_0 + t X_1

This deterministic transport is learned via regression within a diffusion transformer (DiT) backbone, optimizing for:

LSAVE=Et,X0,X1,Xt[vta(Xt,y)uta(XtaX0a,X1a)2+λvtv(Xt,y)utv(XtvX0v,X1v)2]L_{\mathrm{SAVE}} = \mathbb{E}_{t, X_0, X_1, X_t}\Big[ \|v_t^{a}(X_t, y) - u_t^a(X_t^a|X_0^a, X_1^a)\|^2 + \lambda \|v_t^{v}(X_t, y) - u_t^v(X_t^v|X_0^v, X_1^v)\|^2 \Big]

2. Audiovisual Synthesis and Editing Mechanisms

The first SAVE system interprets the time-evolved quantum probability density Aj=ψjt2A_j = |\psi_j^t|^2 as a wavetable for audio synthesis. For a MIDI note frequency fnf_n and sample rate fsf_s, this array is resampled and looped as a single period, with each MIDI note activating an independent voice. Polyphony and human-music workflow are achieved via per-voice state, ADSR envelopes, and real-time parameter mapping.

Stereo imaging is accomplished by applying smooth weighting functions fL(xj)f_L(x_j), fR(xj)f_R(x_j) (fL+fR=1f_L + f_R = 1) to the probability density along the spatial grid, yielding left/right channels that encode ψ2|\psi|^2's spatial features.

Visual rendering displays the evolving probability density and potential V(x)V(x) arrays with synchronized frame rates (30–60 Hz), allowing users to correlate quantum dynamics with audio output in real time (Freye et al., 2024).

In the object-removal SAVE (Xu et al., 14 Dec 2025), input consists of video frames (V0)(V_0), synchronized audio waveform (A0)(A_0), a user-specified text prompt (ytext)(y_\text{text}), and a segmentation mask (M)(M). With conditions encoded via CLAP (audio–text) and C-RADIOv3 (masked images), the model predicts a flow in shared latent space, producing (V1,A1)(V_1,A_1) with the target object removed in both modalities, while preserving context, content, and tight synchronization.

3. Data Construction and Benchmarking

SAVEBench, the dataset underlying the joint removal system, is built from a synthetic pipeline over 50 h of raw video from AudioSet and KlingFoley. For each 8–10 s clip containing two detected "sounding" objects, dedicated audio tracks are synthesized via MMAudio, filtered for source purity, and mixture/edited pairs (A(i),AOi(i))(A^{(i)}, A^{(i)}_{\setminus O_i}) constructed.

Video editing leverages GroundingDINO (object detection), SAM2 (masking), and Inpaint-Anything (frame inpainting) to form aligned pairs (V(i),VOi(i))(V^{(i)}, V^{(i)}_{\setminus O_i}). The dataset comprises 17,306 examples, each with paired 8-second video (192 frames at 128×128128 \times 128), 8-second audio (initially 44.1 kHz, resampled to 32 kHz), a text label, and masks. Object categories span 10 high-level groups (Animals, Vehicles, Instruments, etc.) (Xu et al., 14 Dec 2025).

4. System Architectures and Implementation

The SAVE synthesizer is a VST3/AU plugin implemented in C++ with JUCE, using pocketFFT for rapid spectral transforms, and SIMD-aware loops for efficiency. The engine manages per-voice quantum state and synthesizer buffers, applying time-stepping, wave resampling, envelope shaping, filtering, stereo panning, and live GUI updates in each audio block.

Controls are exposed for initial state (Gaussian wavepacket, superpositions), potential type (harmonic oscillator, barrier, double-well), simulation speed (Δt), grid size, domain zoom, ADSR, filter settings, and real-time parameter mapping.

In the object-removal scenario, the architecture consists of:

  • Audio encoder: pretrained SoundCTM VAE ea:AXaRTa×dae_a: A \mapsto X^a \in \mathbb{R}^{T_a \times d_a}
  • Video encoder: pretrained CV-VAE ev:VXvRTv×dve_v: V \mapsto X^v \in \mathbb{R}^{T_v \times d_v}
  • Temporal projector PτP_\tau aligns video to audio frame rate.
  • Latent concatenation: X0=[X0a,Pτ(X0v)]X_0 = [X_0^a, P_\tau(X_0^v)]; X1X_1 likewise.
  • Conditional encodings: ϕa\phi_a from CLAP, ϕv\phi_v from C-RADIOv3 applied to masked input.
  • SBFM predictor: DiT backbone regresses time-dependent velocities vtθv_t^\theta with separate audio/video heads.
  • Decoders reconstruct edited audio and video from outputs.

Training uses 4×H100 GPUs, AdamW optimizer (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999), batch size O(64)O(64), λ=3\lambda=3 (audio : video), and the Euler ODE solver for 30 sample steps per modality; no further data augmentation beyond synthesis/inpainting.

5. Quantitative Performance and Empirical Outcomes

On SAVEBench, SAVE (Xu et al., 14 Dec 2025) demonstrates state-of-the-art removal fidelity and alignment across multiple metrics:

  • Video removal: PSNR = 20.95, SSIM = 0.69, LPIPS = 0.02, FVID = 13.88, outperforming prior baselines (e.g., VACE + ZEUS, VideoPainter + AUDIT).
  • Audio removal: FAD = 0.69, IS = 5.53, KL = 2.10, PSNR = 18.15, SSIM = 0.51, LSD = 1.44.
  • Audiovisual alignment: DeSync (temporal misalignment) = 0.81 s, IBScore (audio–video match) = 0.20, improved over cross-modal cascades and MCFM.
  • Qualitative observations: Complete removal of target object pixels (video) and corresponding audio energy, maintenance of audiovisual synchronization (e.g., lips with speech onset), and non-interference with background content. Failure cases may include faint artifacts for small/occluded objects or spectral coloration in extreme reverberation.

The synthesizer variant (Freye et al., 2024) enables direct audition of quantum phenomena: e.g., tunneling (wave density/sound leaking through barriers), stationary state oscillations (static timbre with quantum “breathing”), and interference (timbral beating from superpositions), providing an interactive, intuitive exploration of quantum systems.

6. Analytical Perspectives and Limitations

The flow-matching mechanism, anchored in Schrödinger Bridge theory, enables joint latent transport for synchronous audiovisual editing. The model avoids adversarial or reconstruction losses, depending solely on velocity regression for efficiency and stability, and the direct reduction of source–target distributional transport paths yields faster convergence than methods based on noise-driven diffusion.

Precise audio-visual alignment is attributed to cross-modal latent space coupling, mask and text-based conditioning, and integrated time grid projection. However, effectiveness is constrained by the accuracy of automatic segmentation masks (GroundingDINO/SAM2 pipeline) and the diversity of synthetic data; real-world complex mixtures (e.g., occlusions, modal overlap, reverberant backgrounds) may present challenges. The system is currently removal-only; insertion or other transformations require pipeline extensions.

7. Applications and Prospective Extensions

SAVE’s object-level audiovisual editing capabilities support postproduction in film and broadcast (removal of unwanted actors, noises), interactive video editing (user draws mask and types target object for joint audio/video cleanup), and fine-grained style transfer via cross-modal embeddings. Real-time editing is envisioned by reducing ODE steps or distilling the system into a fast feed-forward network. Reverse (“enrichment”) pipelines could enable object insertion or swap by inverting the learned Schrödinger Bridge.

The synthesizer configuration has implications for both musical practice—novel timbres, modulations tightly coupled to quantum dynamics—and science communication or education, lowering cognitive barriers to quantum mechanics via sensory, interactive immersion (Freye et al., 2024, Xu et al., 14 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Schrödinger Audio-Visual Editor (SAVE).