FuseCodec: Unified Speech Tokenization

Updated 21 September 2025

FuseCodec is a neural codec that integrates acoustic, semantic, and contextual features to address limitations in traditional speech tokenization.
It uses latent fusion, global semantic-contextual supervision, and token-level alignment to improve reconstruction fidelity, intelligibility, and naturalness.
Applications include zero-shot speech synthesis, enhanced ASR, voice conversion, and multimodal audio interfaces, underlining its versatile impact.

FuseCodec is a neural codec architecture designed to unify acoustic, semantic, and contextual representations for speech tokenization. Unlike traditional neural codecs, which predominantly capture low-level acoustic features, FuseCodec integrates linguistic and contextual cues through guided cross-modal alignment and supervision. This approach addresses critical limitations in temporal consistency and representation alignment, facilitating downstream tasks such as speech transcription and synthesis with enhanced accuracy and expressiveness.

1. Motivation and Conceptual Framework

The primary motivation for FuseCodec is to resolve the deficiency in existing speech tokenization methods that overlook high-level semantic and contextual signals. Prior architectures, such as EnCodec and SoundStream, emphasize acoustic modeling but are unable to effectively capture linguistic meaning and contextual information. FuseCodec introduces a unified tokenization paradigm that combines three sources:

Acoustic: Features directly extracted from the speech waveform.
Semantic: Phonetic and linguistic meaning obtained from self-supervised speech models (e.g., HuBERT).
Contextual: LLM-derived text information (e.g., BERT embeddings on ASR transcriptions).

This integration is achieved via strong cross-modal alignment and globally informed supervision, resulting in token sequences that are not only optimized for reconstruction but also rich in content for leveraging in various speech and LLMs.

2. Principal Techniques and Innovations

FuseCodec introduces three complementary methodologies to enrich latent representations and impel superior tokenization fidelity:

a. Latent Representation Fusion

Semantic and contextual vectors are incorporated directly into the encoder’s latent space. Specifically, global semantic features are calculated by averaging HuBERT frame-level embeddings; global contextual features are extracted as the [CLS] token from BERT, following transcription via wav2vec 2.0. These vectors are broadcast across the length of the latent sequence and fused with latent acoustic features $Z$ through multi-head cross-attention and additive fusion:

$Z' = Z + (S' \odot \mathcal{D}_S) + (C' \odot \mathcal{D}_C)$

where $S'$ and $C'$ denote cross-attended semantic and contextual signals, $\mathcal{D}_S$ and $\mathcal{D}_C$ are stochastic dropout masks to mitigate single-modality dominance, and $\odot$ denotes elementwise multiplication.

b. Global Semantic-Contextual Supervision

Global distillation loss is applied to the quantized token outputs from the first RVQ layer. Each token projection $Q'^{(1)}_t$ is supervised with broadcasted global semantic ( $\tilde{S}_t$ ) and contextual ( $\tilde{C}_t$ ) signals:

$\mathcal{L}_{distill} = -\frac{1}{T'} \sum_t \log \sigma\left(\frac{1}{2}\left[\cos(Q'^{(1)}_t, \tilde{S}_t) + \cos(Q'^{(1)}_t, \tilde{C}_t)\right]\right)$

where $T'$ is token count, $\cos(\cdot, \cdot)$ denotes cosine similarity, and $\sigma$ is the sigmoid function. This supervision enhances both temporal consistency and cross-modal representational alignment.

c. Temporally Aligned Contextual Supervision

Token-level supervision leverages fine-grained alignment between contextual embeddings and speech token sequences. To address sequence length mismatches, a dynamic window-based alignment assigns each BERT contextual embedding to corresponding RVQ time indices using local cosine similarity. Aligned contextual vectors ( $C^*$ ) inform a distillation loss per timestep:

$\mathcal{L}_{align} = -\frac{1}{T'} \sum_t \log \sigma(\cos(Q'^{(1)}_t, C^*_t))$

This method ensures local contextual grounding for each discrete speech token.

3. Empirical Performance and Comparative Analysis

FuseCodec achieves superior performance metrics on LibriSpeech compared to established baselines:

Variant	WER	WIL	STOI	ViSQOL	PESQ	UTMOS	Speaker Sim.
FuseCodec-Fusion	3.99	6.45	0.95	3.47	3.13
FuseCodec-Distill						3.65	0.996
FuseCodec-ContextAlign

FuseCodec-Fusion demonstrates the lowest Word Error Rate (WER) and Word Information Loss (WIL), with high intelligibility (STOI 0.95) and perceptual quality (ViSQOL 3.47, PESQ 3.13).
FuseCodec-Distill excels in naturalness (UTMOS 3.65) and speaker similarity (0.996).
FuseCodec-ContextAlign balances content preservation and naturalness through token-level alignment.

Baselines such as EnCodec, SpeechTokenizer, and DAC exhibit inferior performance in terms of speech transcription error rates and perceptual metrics due to the lack of high-level cross-modal integration.

4. Applications and Significance

FuseCodec’s advancements extend beyond improved speech reconstruction:

Zero-Shot Speech Synthesis: FuseCodec-TTS incorporates enriched tokens into a text-to-speech framework, yielding synthesis that maintains speaker identity, prosody, and intelligibility in zero-shot settings.
Enhanced Downstream Tasks: The unified token space supports speech language modeling, ASR model re-scoring, voice conversion, and multimodal audio interfaces.
Guided Tokenization: The fusion of acoustic, semantic, and contextual modalities transitions tokenization from simple compression to linguistically expressive encoding suited for integration with generative LLMs.

A plausible implication is broader applicability in tasks requiring both signal fidelity and content preservation.

5. Technical Composition

Core technical components include:

Encoder Architecture: The raw waveform $x$ is encoded into latent representation $Z$ .
Vector Quantization: An 8-layer Residual Vector Quantizer (RVQ) processes $Z$ into discrete tokens $Q^{(1:K)}$ , with $Q^{(1)}$ used for guidance tasks.
Pre-trained Model Integration: wav2vec 2.0 (ASR), BERT (contextual), and HuBERT (semantic) are adopted without further fine-tuning.
Cross-Attention Fusion: Semantic vector fusion is implemented as $S' = \text{CrossAttention}(\tilde{S}, \tilde{C}, \tilde{C}) W_S$ ; analogous procedure for contextual fusion.
Loss Functions: The final objective aggregates time-domain and frequency-domain reconstruction, adversarial, feature-matching, and commitment losses, along with the auxiliary distillation losses ( $\mathcal{L}_{distill}, \mathcal{L}_{align}$ ).

6. Open Research Directions

FuseCodec's methodology invites further exploration in several areas:

Alignment Algorithm Enhancement: Optimizing dynamic alignment techniques to accommodate variable-length speech and fluctuating token durations.
Modal Weighting and Expansion: Introducing adaptive weighting schemes and integrating additional modalities (e.g., prosodic or emotional features).
Scalability and Multilingual Extension: Assessing performance across larger and more diverse datasets, as well as enabling robust cross-modal fusion in multilingual settings.
Integrative Generative Models: Evaluating the incorporation of enriched tokens into end-to-end generative models for dialog systems and interactive audio applications.

This suggests FuseCodec provides a compelling foundation for future neural codec architectures centered on high-level semantic and contextual integration.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to FuseCodec.