Audeo: Audio Generation for a Silent Performance Video (2006.14348v1)

Published 23 Jun 2020 in cs.CV, cs.LG, cs.MM, cs.SD, eess.AS, and eess.IV

Abstract: We present a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video. Generation of music from visual cues is a challenging problem and it is not clear whether it is an attainable goal at all. Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association of sounds with visual events. To achieve the transformation we built a full pipeline named \textit{Audeo}' containing three components. We first translate the video frames of the keyboard and the musician hand movements into raw mechanical musical symbolic representation Piano-Roll (Roll) for each video frame which represents the keys pressed at each time step. We then adapt the Roll to be amenable for audio synthesis by including temporal correlations. This step turns out to be critical for meaningful audio generation. As a last step, we implement Midi synthesizers to generate realistic music. \textit{Audeo} converts video to audio smoothly and clearly with only a few setup constraints. We evaluate \textit{Audeo} onin the wild' piano performance videos and obtain that their generated music is of reasonable audio quality and can be successfully recognized with high precision by popular music identification software.

Authors (3)

Kun Su (16 papers)
Xiulong Liu (17 papers)
Eli Shlizerman (45 papers)

Citations (58)

View on Semantic Scholar

Summary

Audeo: An Innovative Approach to Audio Generation from Silent Piano Videos

The paper "Audeo: Audio Generation for a Silent Performance Video" presents an intriguing system designed to generate music audio from video recordings of piano performances, capturing the musician's actions without sound. The authors built a comprehensive pipeline named 'Audeo', composed of three core components that facilitate this transformation. The paper explores the complexities involved in associating visual events with musical sounds, exploring whether such transformation is indeed plausible as well as the potential methods for achieving it.

Summary of Methodology and Components

The central challenge addressed in this research is the conversion of visual data into audio, without the typical reliance on sound input. The Audeo pipeline involves several sophisticated processes to ensure accurate music generation:

Video2Roll Net: The first component transforms video frames of the piano keyboard and the pianist's hand movements into a raw mechanical representation known as a Piano-Roll. This multiscale feature attention network captures spatial dependencies and accurately identifies which keys are pressed at each frame, handling complications such as hand occlusion and temporal precision.
Roll2Midi Net: The second stage involves refining the Piano-Roll output using a Generative Adversarial Network (GAN). This process adjusts the raw representation while capturing the temporal correlations and musical attributes necessary for generating realistic music. The resultant symbolic musical signal is termed Pseudo-Midi and forms the basis for audio synthesis.
Midi Synthesization: Finally, the Pseudo-Midi is converted to audio using Midi synthesizers. The authors implemented both classical and deep learning-based synthesizers, specifically PerfNet, to provide realistic sound output by transforming Midi representations into audio waveforms.

Results and Implications

The Audeo system demonstrates consistent results in generating audio that maintains reasonable fidelity and can be identified accurately by popular music identification software, indicating robustness and generality. The system was tested rigorously on 'in the wild' piano performances, achieving high precision recognition rates. While the generated audio does not entirely eliminate discrepancies between visual cues and audio reality, Audeo's approach significantly narrows this gap by employing sophisticated visual recognition and signal processing techniques.

Future Prospects in AI and Audio-Visual Transformation

The implications of this research extend into several domains. Practically, the system can be fused into tools for music transcription and virtual learning environments, aiding musicians in capturing and reproducing their performances accurately. Theoretically, this deep learning-based approach opens avenues for enhanced multimedia processing, where emergent AI models can bridge perceptual gaps between visual and audio data effectively, which may include more complex instruments and acoustic settings.

Further research can expand Audeo's capabilities to encompass a wider range of instruments and genres. Exploration into domain adaptation or fine-tuning of GANs and synthesizers can improve audio rendition, providing more dynamic and context-sensitive sound generation.

In conclusion, Audeo exemplifies a meticulous approach toward merging visual and auditory realms through AI, beautifully demonstrating the potential of computational systems to interpret and recreate artistic expressions from silent video data. Despite inherent challenges, the methods showcased by the authors offer promising steps forward in the field of audio-visual AI translation.

PDF Markdown

Related Papers

YouTube

Show All Videos