Schrodinger Audio-Visual Editor (SAVE)
- Schrodinger Audio-Visual Editor (SAVE) is a dual-technology framework combining an end-to-end audiovisual removal system with a quantum-inspired synthesizer for creative content manipulation.
- It employs unified cross-modal encoding and a Schrödinger Bridge flow matching strategy to achieve precise temporal alignment and object-level editing across both audio and video streams.
- The system further integrates a real-time synthesizer based on the 1D time-dependent Schrödinger equation, enabling dynamic visualization and audible representations of quantum phenomena.
The Schrodinger Audio-Visual Editor (SAVE) refers to two distinct but independently significant technologies in contemporary computational research: (1) an end-to-end multimodal editor that enables object-level removal in both audio and video streams for content creation, and (2) a quantum-mechanics-inspired synthesizer and visualization plugin that enables real-time auditory and visual exploration of solutions to the one-dimensional time-dependent Schrödinger equation. Both systems are unified in their rigorous mathematical formulation, real-time audiovisual processing, and technical innovation within their target domains (Xu et al., 14 Dec 2025, Freye et al., 2024).
1. Object-Level Audiovisual Editing: Overview and Motivation
SAVE, as described in "Schrödinger Audio-Visual Editor: Object-Level Audiovisual Removal," addresses the challenge of synchronous, object-grounded editing of audio and video for tasks such as removing a specific object and its corresponding sound from a scene. Traditional editing methods, which combine independent audio and video editors, often disrupt spatiotemporal synchronization and semantic correspondence. These limitations motivate the joint modeling of audio and video streams, with a focus on rigorous alignment throughout the edit pipeline (Xu et al., 14 Dec 2025).
The specific task, known as audiovisual removal (AVR), requires paired "before/after" examples with precisely aligned media, posing significant data collection and modeling challenges due to audio-visual modality heterogeneity. SAVE's contribution is to provide synthetic paired datasets and a unified model architecture explicitly designed for parallel object-level editing across both domains.
2. SAVEBench: Synthetic Dataset for Joint Audiovisual Edits
SAVEBench is a synthetic paired dataset central to SAVE's training and evaluation. Its construction involves:
- Source Data Collection: Sampling the first 10 seconds from AudioSet and KlingFoley, aggregating to approximately 585 hours.
- Object Identification: Qwen-VL and GPT-4-mini automatically determine the two most frequent sounding objects per clip.
- Audio Synthesis and Pairing: For each object , clean, frame-synchronous tracks are synthesized using MMAudio prompts and purity is verified with Qwen-Audio and GPT-based distillation, forming mixtures (with object) and ("without" object).
- Visual Pairing: Object bounding boxes are detected with GroundingDINO, segmentation masks extracted using SAM2, and inpainting for object removal performed with Inpaint-Anything, leading to and pairs.
- Annotations: Each of the 17,306 examples includes the name of the removed object and its mask.
This automated pipeline enables the generation of large-scale, well-aligned before/after samples that are otherwise infeasible with real-world capture (Xu et al., 14 Dec 2025).
3. Model Architecture and Schrödinger Bridge Formulation
SAVE's end-to-end architecture employs cross-modal encoding and flow-matching driven by the Schrödinger Bridge (SB) principle:
- Encoders: Video streams are processed via a CV-VAE encoder (), and audio waveforms by SoundCTM encoder ().
- Temporal Alignment: Video latents are temporally projected via to align with audio time resolution; the complete source latent is .
- SB Flow Matching: Rather than conventional noise-to-data diffusion, SAVE constructs a stochastic process connecting source and edited ("target") latents and through the SB, pinched at endpoints and parametrized by a DiT backbone with separate audio and video heads.
- Velocity Field: The conditional velocity field is rigorously defined (Equation (1) in the source), and neural networks are trained to regress to this true flow, yielding a continuous transport ODE between mixtures.
The conditioning signals include a CLAP-derived text embedding (e.g., "remove the seal") for audio and a C-RADIOv3 encoding of the object-masked video for visual attention. The flow-matching loss (Equation (2)) includes a tunable weight to optimize the audio-visual fidelity/synchronization balance (Xu et al., 14 Dec 2025).
4. Quantitative Evaluation, Baselines, and Qualitative Examples
SAVE is evaluated through a comprehensive suite of quantitative metrics:
- Visual Fidelity: PSNR, SSIM, LPIPS, FVID.
- Audio Fidelity: Log-Spectral Distance (LSD), Fréchet Audio Distance (FAD), Inception Score (IS), KL divergence.
- Audiovisual Alignment: DeSyncScore (temporal drift) via Synchformer, IBScore (cosine similarity of ImageBind AV embeddings).
Baselines include all cross-modal combinations of {ZEUS, AUDIT} audio editors and {VACE, LGVI, VideoPainter} video inpainting models, as well as a multimodal Gaussian noise + classical flow matching (CFM) alternative.
Table 1–3 in the original paper report that SAVE substantially outperforms all baselines:
| Metric Type | SAVE (Best) | Baselines |
|---|---|---|
| Visual PSNR | 20.95 dB | 13 dB |
| Audio FAD | 0.69 | higher |
| AV DeSyncScore | 0.81 s | higher |
Ablation studies confirm the criticality of modality-specific DiT heads, CLAP conditioning, and optimal loss weighting (best at ). Qualitative visualizations (e.g., toy set, motorcycle, fire truck) demonstrate artifact-free removal of both visual and sonic components, in stark contrast to conventional pipelines (Xu et al., 14 Dec 2025).
5. Workflow, Training, and Deployment Considerations
SAVE training is performed at 44.1 kHz audio, 128×128 video at 8 fps, on DiT backbones (16 layers, hidden dim 2048, 16 heads; audio/video heads with 4 layers and 4 heads each), using AdamW optimizer and batch sizes set per GPU memory (e.g., 4×H100 for 42 epochs in about 12 hours). Loss balancing is set to .
In production, SAVE can be exposed as an interface requiring users only to mask the target object and provide a text cue (e.g., “remove the seal”), producing results in a few seconds. All VAEs are fixed at inference, and the framework can be generalized to other paired AV edit tasks such as inpainting or style transfer given appropriate data (Xu et al., 14 Dec 2025).
6. Quantum Audio-Visual Synthesis and Rendering
A parallel research thread, "Creating a Synthesizer from Schrödinger's Equation," employs the acronym SAVE for a real-time synthesizer and visualization tool that numerically integrates the 1D time-dependent Schrödinger equation:
Implemented with fixed units (), SAVE discretizes the spatial domain and applies the second-order split-step (Strang) method per timestep, with transformations alternating between real and momentum space using an optimized FFT. The simulated probability density is read as a wavetable for audio synthesis, mapped to DAW pitch via phase accumulation and interpolation.
Stereo output is achieved via parametric channel-weight functions, and normalization ensures zero DC offset. The graphical UI renders , , and optionally the phase in real time at 60 fps, using lock-free FIFO communication between audio and GUI threads. Core modules—SimulationEngine, FFTModule, Voice, AudioProcessor, GUIEditor—are orchestrated for deterministic playback in plugin or standalone host modes.
Audible quantum effects such as tunneling (leakage through a barrier) and superposition (beatings from double-Gaussian initializations) manifest as evolving timbral shifts and modulations, thereby providing both musical utility and insight into quantum dynamics. All controls, including mass, potential shape, initial wavefunction, boundary conditions, step size, and sample resolution, are exposed through a standardized, MIDI-automatable UI (Freye et al., 2024).
7. Current Limitations and Opportunities
For the object-level edit model, SAVE's current limitations include the synthetic focus of SAVEBench (real-object removal, insertions, and multi-object scenarios are not represented) and compute-bound ODE sampling at inference (2–3 s per clip). Current UI requires mask input; integrating automatic mask generation and text-only prompts remains an open area. Extensions could include longer clip handling, object-level style transfer, and compositional editing (Xu et al., 14 Dec 2025).
For the quantum synthesizer, while real-time simulation at musical rates is optimized on commodity hardware, further work could extend dimensionality, alternate discretizations, or explore more exotic mappings from quantum states to audio/visual modalities. Both lines of work represent intersections of modality engineering, rigorous mathematical modeling, and practical system design within contemporary audio-visual computation (Freye et al., 2024).