DOVSinger: Unified Ensemble Voice Synthesis
- DOVSinger System is a deep neural network architecture that models inter-singer interactions to synthesize a unified vocal ensemble.
- It uses multi-part musical scores with cross-part attention and specialized loss functions to jointly predict acoustic features and reduce pitch variance.
- Experimental results show improved perceptual unity and reduced pitch scatter compared to independent solo synthesis, highlighting its practical effectiveness.
The DOVSinger System refers to a deep neural network (DNN)-based ensemble singing voice synthesis (SVS) architecture that explicitly models interactions between singers to achieve a unified ensemble voice. Unlike conventional SVS systems that are designed primarily for solo voice synthesis and do not incorporate inter-singer adjustments, DOVSinger introduces cross-part interaction mechanisms and specialized loss functions, utilizing multi-part musical scores as input. The objective is to improve the perceptual unity and coherence of synthesized vocal ensembles by simulating how singers modify their vocal output in response to one another.
1. Problem Statement and Motivation
Traditional singing voice synthesis models largely focus on solo performances, treating ensemble synthesis as mere aggregation of independently generated solo tracks. This neglects the phenomenon in natural ensembles where individual singers adjust pitch, timing, timbre, and expression to achieve unity with others. Such omission typically degrades the overall ensemble quality, with noticeable lack of synchrony and blend in synthetic choruses. DOVSinger proposes to address this deficit by modeling inter-singer interaction effects directly in the synthesis pipeline, aiming to produce ensembles where voices exhibit mutually-dependent adjustment behaviors.
2. System Architecture
The DOVSinger architecture processes musical scores for each ensemble part and outputs acoustic features that are conditioned on both individual and collective context:
- Score Encoder: Processes musical scores (pitch, rhythm, text, expression) for each voice part, enabling encoding of intended musical structure.
- Singer Encoder: Models the timbral identity and individual style of each singer, typically via embeddings.
- Acoustic Model: Predicts framewise acoustic features such as mel-spectrograms, fundamental frequency (), energy, and aperiodicity.
- Interaction Modeling Module: A critical novelty, this module conditions each voice's acoustic generation not only on its own score but also on the predicted or planned acoustic outputs of the other ensemble members. Architecturally, this may be realized by cross-part attention or autoregressive dependencies, enabling contextual adjustment and mutual influence.
3. Input Representation: Multi-Part Musical Scores
DOVSinger employs input in the form of multi-part musical scores, encoded as a set
where represents the score for the -th singer. This structure allows explicit modeling of inter-part reference and interaction, facilitating computational mechanisms whereby each part can adapt its output in relation to others.
4. Modeling Singer Interactions
Key to the DOVSinger methodology is explicit simulation of interaction among singers. Mechanisms include:
- Cross-Conditioning: Each singer's synthesis process receives, in addition to their own score, information derived from the outputs or features of other singers. For part , the model can access predicted , timbre, or spectral features from parts , supporting mutual adaptation.
- Self-Attention Across Singers: Attention mechanisms spanning all singers enable the ensemble synthesis model to capture both local and global dependencies across voice parts, reflecting real-world ensemble coordination such as unified intonation and expressive synchronization.
- Micro-Intonation and Vibrato Interaction Modeling: The system can incorporate fine-grained pitch deviations and expressive features that arise from singers matching or intentionally diverging in ensemble settings.
5. Loss Function Design
Loss functions in DOVSinger extend beyond standard framewise acoustic matching to incorporate ensemble-level dependencies:
- Individual Loss (): Measures the fidelity of each singer's synthesized features against ground truth (e.g., MSE for spectrogram, , etc.).
- Interaction Loss (): Penalizes excessive divergence and enforces desired alignment or blending across ensemble members. Formulations include:
- Pitch Variance Penalty:
promoting small pitch variance for unison passages. - Spectral Blending:
enforcing timbral similarity that enhances perceptual unity. The total loss is a weighted sum:
where modulates interaction strength.
6. Experimental Protocols, Datasets, and Evaluation Metrics
Experiments with DOVSinger are conducted using multitrack choral datasets such as Dagstuhl ChoirSet and jaCappella, enabling access to genuine ensemble recordings for supervised training. Evaluation employs:
- Unity metrics: Quantifying pitch scatter (variance in across singers) and spectral blend.
- Perceptual quality: Mean opinion scores (MOS) reflecting subjective ratings of ensemble coherence.
- Baselines: Comparison against models synthesizing each singer independently (no interaction modeling).
A representative result table:
| Method | Pitch Scatter (cents) | Perceived Unity (MOS) |
|---|---|---|
| Independent synthesis | 25 | 3.0 |
| With interaction modeling | 12 | 4.2 |
These outcomes indicate that incorporating interaction modeling substantially reduces pitch scatter and improves perceived unity.
7. Significance and Extensions
The DOVSinger System demonstrates that the modeling of inter-singer dependencies is essential for high-fidelity ensemble voice synthesis, capturing aspects of real-world choral performance neglected by solo-based approaches. By leveraging multi-part scores, cross-part conditioning, and interaction-aware losses, DOVSinger advances the perceptual quality and unity of the generated ensemble output. This suggests broad potential for further refinement, including incorporation of more complex adaptive mechanisms, extension to other forms of musical interaction, and application to non-Western ensemble traditions. A plausible implication is the applicability of similar interaction modeling in instrumental music ensemble synthesis and expressive performance rendering.
8. Related Research and Future Directions
Research streams informing and extending DOVSinger include attention-based sequence modeling ("Sinsy"), diffusion and adversarial SVS ("DiffSinger", "Visinger"), and choral ensemble analysis (Chandna et al., Frontiers Signal Process., 2022; Miyazawa et al., Proc. IPSJ SIG-SLP, 2023; Rosenzweig et al., TISMIR, 2020; Nakamura et al., ICASSP, 2023). Future work may address larger and more complex ensembles, diverse musical contexts, real-time interaction simulation, and integration with expressive waveform synthesizers. The role of interaction-aware loss weighting, data efficiency, and fine-grained expressive modeling remains open for systematic investigation.