Audeo System: Advanced Audio Processing

Updated 1 February 2026

Audeo System is a suite of advanced real-time audio processing platforms spanning video-to-audio synthesis, body balance biofeedback, and adaptive embedded noise suppression.
Each variant employs specialized architectures: deep CNNs and GANs for video synthesis, kinematic modeling for biofeedback, and LMS adaptive filtering on ESP32 for embedded enhancement.
Evaluations report up to 85.6% key detection in video synthesis, significant sway reduction in biofeedback trials, and effective noise suppression in embedded applications.

The term "Audeo System" designates several advanced real-time audio processing platforms spanning diverse application domains: silent video-to-audio synthesis, body balance biofeedback, and adaptive noise suppression systems. Presented across multiple independent research efforts, each system employs unique architectures and algorithmic methodologies for extracting, generating, or enhancing audio signals, leveraging sensor arrays, embedded computation, and deep learning as appropriate. The following systematically examines the principal Audeo System research threads as documented in the literature.

1. Video-to-Audio Synthesis: The "Audeo" Pipeline

The system introduced by Ishay et al. under the name "Audeo" addresses the challenging task of generating realistic audio (music) from silent video of a piano performance (Su et al., 2020). The pipeline comprises three sequential stages:

Video-to-Piano-Roll Translation (Video2Roll Net): A convolutional neural network receives stacks of five consecutive grayscale, cropped keyboard frames and output per-frame, per-key probabilities for keypress events, producing a raw piano-roll sequence $\hat R \in [0,1]^{T \times K}$ with $K=88$ keys and $T$ time frames.
Temporal Correlation Adaptation (Roll2Midi Net): To correct temporal inconsistencies and model sustain, a U-Net-based generator within a GAN framework smooths the piano-roll sequence over overlapping 100-frame blocks. This produces a temporally coherent pseudo-Midi representation $\hat M_R$ , thresholded to yield a binary piano-roll.
MIDI Synthesis: Two synthesis routes are provided: (A) Classical, using FluidSynth to convert binary pseudo-Midi into WAV; and (B) Deep, where a performance-timbre neural network (PerfNet) translates the piano-roll to a magnitude spectrogram, which is refined by a U-Net and converted to time-domain audio by Griffin-Lim. Both methods operate at 16 kHz.

The system is trained and evaluated on aligned videos and extracted audio-derived ground truth. The full pipeline achieves frame-level F $_1$ scores exceeding previous state-of-the-art, with detection rates (via SoundHound) for generated music reaching up to 85.6% on seen pieces and 72.4% on unseen pieces. FluidSynth-based audio exhibits higher spectral fidelity but less expressive velocity variance than PerfNet-based synthesis (Su et al., 2020).

2. Body Balance Audio-Biofeedback: IMU-Driven Audeo System

Costantini et al. developed a wireless audio-biofeedback "Audeo System" for postural balance enhancement in human subjects (Costantini et al., 2019). Its architecture comprises:

Hardware:
- Belt-worn, back-mounted IMU ("Movit," size 4.77 × 4.12 × 1.76 cm, mass 20g) between L2–L5, sampling 3-axis acceleration (±2g to ±16g), gyroscope, magnetometer, and barometer at 50 Hz; data streamed wirelessly to a PC for analysis.
- PC (Intel i5, 4 GB RAM) running Captiks Motion Studio SDK for IMU data acquisition and Max/MSP for real-time audio feedback generation.
- Open-type headphones (AKG K701, 10 Hz–39.8 kHz, 105 dB SPL/V) providing stereo biofeedback.
Software Pipeline:
- Real-time computation of trunk mediolateral (AP, $x$ ) and anteroposterior (ML, $y$ ) tilt angles per sample:
- $\theta_x(t) = \arctan2(a_x(t), a_z(t)), \quad \theta_y(t) = \arctan2(a_y(t), a_z(t))$
- where $x(t), y(t)$ serve as a virtual point $P$ on the floor plane.
- Sway magnitude: Euclidean distance $d(t) = \sqrt{(x(t)-x_0)^2 + (y(t)-y_0)^2}$ from upright origin.
- Metrics: sway range $R$ and variance $V$ computed per trial; percent improvement with ABF-on computed as
$\text{PR} = 100 \cdot \frac{R_{\mathrm{noABF}} - R_{\mathrm{ABF}}}{R_{\mathrm{noABF}}}, \quad \text{Pv} = 100 \cdot \frac{V_{\mathrm{noABF}} - V_{\mathrm{ABF}}}{V_{\mathrm{noABF}}}$
Audio-Biofeedback Mapping:
- The $(x, y)$ tilt-plane is partitioned into six regions (A–F), from "safe" to "critical." Increasing trunk sway triggers increasingly urgent pink or filtered noise, with critical excursions eliciting panned, pulsed narrowband warnings, localizing directionality in the stereo field.
- All mapping parameters (bandwidth, gain, gating period) are mathematically specified for spatial and urgency encoding.
Evaluation:
- n=10 subjects (4 young, 6 old) underwent standing balance tests (solid/foam surface, eyes open/closed, ABF off/on).
- Median sway reduction (PR) ranged from 10.65% to 65.90%, with variance reductions (Pv) of 26.19%–65.90%.
- Both age groups benefitted, especially on solid surface with eyes open (young: PR=56.0%, Pv=65.6%; old: PR=34.4%, Pv=58.1%).
- Consistent ABF benefit seen in every subject and condition.
- Forthcoming directions include individual-adaptive thresholds, improved tilt estimation (sensor fusion), multimodal biofeedback, and long-term/transfer testing (Costantini et al., 2019).

3. Adaptive Embedded Audio Enhancement: ESP32-Based Audeo System

The work "Filtro Adaptativo y Módulo de Grabación en Dispositivo Para Mejora en la Calidad de Audición" presents an embedded auditory enhancement system—also referenced as "Audeo System"—that employs real-time LMS adaptive filtering on an ESP32 microcontroller (Torres et al., 24 Feb 2025). Key elements include:

Hardware Block:
- ESP32 (ESP-WROOM-32) hosting I $^{2}$ S audio bus, connected to an INMP44 MEMS microphone, MAX98357A class-D amplifier/DAC for speaker output, and optional ISD1820 voice recorder module.
- PC interface via Arduino IDE for programming and real-time data visualization.
Processing Chain:

Digitized audio (24-bit, 48 kHz) from INMP44 via I $^{2}$ S to ESP32.
Pre-processing: normalization and low-pass filtering (20 Hz–20 kHz).
LMS adaptive filter (order 64): $w(n+1) = w(n) + \mu\, e(n)\, \mathbf{x}(n)$ where $\mathbf{x}(n)$ is the input vector and $e(n) = d(n) - y(n)$ with $d(n) = x(n)$ . Step-size $\mu$ empirically set ( $5\times10^{-5}$ ), with recommendations for stability ( $0 < \mu < 1/\sum E[x^2(n-i)]$ ).
Post-processing: convert to 16-bit PCM, playback through MAX98357A and speaker/headphones.

Software & Real-Time:
- Implemented in C/C++ (Arduino IDE), leveraging Espressif’s I $^{2}$ S drivers; designed to meet 48 kHz sampling with 32×512-sample DMA buffers and minimal computational and memory load (approx. 9.6 MFLOP/s).
Performance:
- Subjective listening tests showed "complete noise suppression" and improved SNR; formal SNR and intelligibility metrics were not quantified.
- Latency is described as "minimal."
- Main limitations are RAM-dependent filter order (N=64), and no objective performance metrics due to lack of calibrated instruments.
AI/ML Role:
- The implementation features dynamic adjustment of $\mu$ for adaptive convergence but does not employ explicit machine learning noise classifiers or neural networks.
- The use of "técnicas de inteligencia artificial" is limited to step-size adaptation; future extensions proposed to employ ML for environment classification and parameter tuning (Torres et al., 24 Feb 2025).

4. Algorithmic and Architectural Distinctives

Application Domain	Sensing/Acquisition	Main Algorithmic Paradigm	Output/Feedback Mode
Video-to-audio synthesis	Video (keyboard, hand tracking)	Deep CNN (ResNet, U-Net GAN), MIDI synth	Audio (music waveform)
Postural biofeedback	IMU (3a, gyro, mag, baro)	Kinematic modeling, region-mapped sonics	Stereo real-time biofeedback
Embedded hearing enhancement	MEMS mic, ESP32	LMS adaptive filtering	Speaker, optional recording

The only system featuring deep learning and adversarial training is the video-to-audio Audeo pipeline (Su et al., 2020). The balance-improving Audeo System leverages real-time sensor-kinematics and empirically partitioned signal regions for graded audio alerting (Costantini et al., 2019). The ESP32-based Audeo System focuses strictly on adaptive signal enhancement using classical adaptive filtering and manual parameter tuning (Torres et al., 24 Feb 2025).

5. Practical Constraints, Evaluation, and Limitations

Each system is shaped by its target domain's constraints and evaluation regime:

The video-to-audio Audeo system requires a fixed camera, visible full keyboard, overhead angle, and benefits from controlled lighting; real-time operation is feasible only for FluidSynth-based synthesis on CPUs, while deep synthesis is approximately $0.3 \times$ real-time on modern GPUs.
Balance biofeedback trials are limited by subject familiarity, a small sample size ( $n=10$ ), and absence of long-term retention benchmarks; benefit is demonstrated across age and condition, but transfer/generalization is unmeasured.
ESP32 Audeo enhancement is limited by low-order filtering and unreported objective metrics; qualitative user reports and visual waveform inspection constitute the principal evaluation methods.

In all cases, extension to broader populations, more complex environments, or cross-modal/multimodal scenarios is noted as an open line of research.

The Audeo Systems share commonalities with broader trends in embedded audition, artificial sensory augmentation, and neural audio synthesis, but are distinct in the following respects:

The video-to-audio Audeo system's pipeline differs from prior video-to-audio models by coupling frame-level vision-to-symbolic mapping with explicit temporal adaptation and GAN-based sequence refinement, and by offering both classical and neural waveform synthesis. Performance benchmarks demonstrate improvement over single-stage convolutional models (Su et al., 2020).
The balance biofeedback Audeo System is situated within a lineage of somatosensory substitution devices; its use of empirically optimized, spatialized audio cues for postural region encoding is a distinguishing design (Costantini et al., 2019).
The ESP32-based system's approach is representative of low-cost, embedded speech enhancement and does not yet implement more advanced multi-microphone or machine-learning-based noise suppression, though such extensions are discussed (Torres et al., 24 Feb 2025).

7. Future Prospects and Research Directions

Outstanding research opportunities identified across Audeo System variants include:

For video-to-audio synthesis, improving robustness beyond controlled settings (e.g., camera angle invariance, occlusions), reducing artifacts in deep synthesis, and extending to non-piano instruments.
For body balance biofeedback, implementing individualized adaptive mapping, sensor fusion (e.g., Kalman/complementary filtering of inertial data), multi-modal cues, and longitudinal efficacy studies incorporating transfer and retention.
In embedded auditory enhancement, scaling up filter complexity with more capable MCUs, integrating lightweight neural noise classification methods, and implementing objective metrics logging for systematic benchmarking.

Continued refinement, empirical evaluation, and open-source dissemination constitute major priorities for advancing practical impact in clinical, educational, and consumer domains.