DAC-SE1: High-Res Speech Enhancement
- DAC-SE1 is a unified high-resolution speech enhancement framework that maps discrete noisy DAC tokens to clean output using a single-stage autoregressive transformer model.
- It employs a DAC codec with 9 RVQ codebooks at 44.1 kHz, preserving both global semantic content and fine-grained acoustic details through efficient tokenization.
- The model surpasses previous methods as evidenced by superior DNSMOS, PESQ, and MUSHRA scores, demonstrating robust multitask performance on diverse distortions.
DAC-SE1 is a unified, high-resolution speech enhancement framework that applies a single-stage LLM to map discrete, high-fidelity audio tokens representing noisy input directly into clean output. Leveraging recent advances in neural audio codecs and autoregressive transformers, DAC-SE1 is designed for scalable, domain-general speech enhancement, preserving both fine-grained acoustic detail and semantic coherence without reliance on low-sample-rate or semantic-only tokenizations. The framework surpasses prior autoregressive speech enhancement approaches in both objective (DNSMOS, PESQ, SpeechBERTScore) and subjective (MUSHRA) metrics and is structured to facilitate reproducibility and extensibility through released code and models (Lanzendörfer et al., 2 Oct 2025).
1. Architecture and Tokenization
DAC-SE1 encodes audio using the high-resolution DAC codec, which employs residual vector quantization (RVQ) to generate a stack of 9 codebook layers—each with 1024 symbols—at 44.1 kHz. The codebooks are flattened into a single time-major token sequence, yielding approximately tokens per second of audio. This design ensures each token sequence encodes both global and subtle local waveform structure, enabling preservation of speech content and timbral nuances.
The enhancement model is a causally masked, transformer-based LLM with a LLaMA backbone (1 billion parameters, 24 layers, hidden size 1536, 24 attention heads, feedforward dimension 6144). Rotary positional embeddings (RoPE) with a high scaling factor () enable efficient modeling of long-range dependencies for sequences up to 8192 tokens.
2. Single-Stage Enhancement Process
In training and inference, the model receives a concatenated sequence: The "start-clean" boundary token signals the transition from degraded to reference conditions. The transformer learns to autoregressively predict clean tokens conditioned on preceding noisy context. This single-stage mapping approach contrasts with prior multi-stage pipelines, which often rely on separate semantic extraction, auxiliary encoders, or operate only at low (e.g., 16 kHz) sample rates.
By modeling enhancement entirely within a token-to-token framework, DAC-SE1 simplifies the architecture, minimizes explicit domain assumptions, and empowers scaling via data and model size rather than custom task structuring.
3. Preservation of Semantic and Acoustic Structure
Unlike codecs or enhancement pipelines that operate at a semantic or coarse spectral level (e.g., HuBERT, WavLM), the use of tokenized DAC representations at 44.1 kHz allows the network to reconstruct both intelligibility and signal fidelity. The causal transformer, via its attention mechanisms, models both long-range dependencies (preserving global semantic content) and local token patterns (preserving fine-grained acoustic detail).
The training objective is to minimize prediction error for clean tokens, enabling the model to produce output that is both free of noise/distortion and faithful to linguistic content, as validated by both objective and human evaluation.
4. Training Regimen and Multitask Enhancement
The model is trained in two stages. Stage one is multitask: the model learns mappings for a variety of distortions in parallel (additive noise, reverberation, downsampling, packet loss). Stage two fine-tunes the model for each distortion type individually to ensure balanced loss contributions and robust performance across conditions.
The training corpus encompasses over 5 billion tokens, with training conducted on distributed H200 GPUs over 12 hours. This large-scale training enables generalization and high-capacity mapping from diverse noisy input conditions to high-fidelity output.
5. Objective and Subjective Evaluation
DAC-SE1 surpasses previous autoregressive and classical speech enhancement methods (e.g., LLaSE-G1, VoiceFixer) on a range of evaluation metrics. Across benchmark datasets (such as HiFiTTS-2):
- Objective scores: Higher DNSMOS (OVRL/SIG/BAK), PESQ, and PLCMOS values. Improved Whisper-based WER, indicating maintained or improved speech intelligibility in automatic speech recognition tasks.
- Subjective scores: MUSHRA listening tests showed strong listener preference for DAC-SE1 outputs, demonstrating the model's ability to enhance perceived naturalness and intelligibility without introducing artifacts.
These results reflect both the high-fidelity nature of the DAC codec tokenization and the effectiveness of the single-stage transformer architecture in preserving signal quality.
6. Technical Features and Implementation Details
| Component | Detail | Role/Significance |
|---|---|---|
| Codec | DAC, 9 RVQ codebooks @1024 codes, 44.1 kHz | Dense, high-fidelity tokenization |
| Model | LLaMA-based transformer, 1B params, 24 layers | Autoregressive mapping from noisy to clean tokens |
| Sequence Handling | 8192 token max, RoPE θ=100,000, causal masking | Long-range, efficient contextual modeling |
| Boundary token | "start-clean" | Marks transition in token sequence |
| Training (stage 1/2) | Multitask / Distortion-specific fine-tuning | Robustness and balanced enhancement across tasks |
| Data/Compute | 5B+ tokens, H200 GPUs, 12 hours | Supports large-scale, diverse training |
The fully discrete sequence format and reliance on standard causal transformers facilitate deployment and reproducibility. Released codebases and checkpoints further support community benchmarking and extension.
7. Future Research and Extensions
The DAC-SE1 paradigm suggests several research trajectories:
- Scaling: Larger model sizes and more diverse data may improve generalization and signal quality, potentially matching studio-grade restoration on arbitrary speech content.
- Unified Audio Transformation: A single autoregressive LM can handle multiple speech enhancement tasks (noise, reverberation, packet loss), indicating the potential for unified audio restoration and transformation frameworks.
- Tokenization Advances: Exploring alternative or adaptive tokenization strategies could further improve the preservation of spectral and prosodic cues in challenging real-world signals.
- Application Domains: The architecture may be adapted to related tasks such as high-fidelity music enhancement, audio super-resolution, or multimodal speech/audio generation.
A plausible implication is the obviation of bespoke multi-stage pipelines for speech enhancement, with the emergence of scalable, generalist LMs that operate efficiently on discrete, high-resolution audio tokens. The open release of code and checkpoints is positioned to accelerate research in this direction (Lanzendörfer et al., 2 Oct 2025).