Codec-Free Modality: Neural Representation & Compression

Updated 2 November 2025

Codec-free modality is an approach that uses continuous, learned embeddings instead of discrete codec tokens, enabling unified, end-to-end neural processing.
It underpins innovations in multi-modal AI, offering full-duplex speech synthesis and enhanced cross-modal reasoning in systems like SALMONN-omni and UniAudio.
Neural compression methods such as VC-INR and CMVC leverage continuous representations to achieve superior bitrate efficiency and fidelity.

Codec-free modality refers to information representation and processing paradigms that eliminate the need for explicit quantization and discrete tokenization schemes typically associated with traditional codecs. In codec-free systems, modality-specific data (speech, audio, video, visual signals) are encoded, transmitted, and decoded using continuous, learned embeddings, neural transform models, or alternative modality bottlenecks. This architectural shift has broad implications for multi-modal AI, compression, generation, and cross-modal reasoning.

1. Principles and Definitions

Codec-free modality is characterized by the absence of discrete quantized tokens obtained from conventional codecs such as EnCodec or SoundStream (for audio/speech) or JPEG, H.264/HEVC, VVC (for visual signals and video). Instead, systems operate on high-dimensional, continuous representations such as:

Continuous learned embeddings (e.g., auditory embeddings, neural feature vectors)
Neural model states (e.g., weights, gating vectors)
Semantic/knowledge bottlenecks (e.g., cross-modal token alignment, text/image bottlenecks)

A system is considered codec-free if all inter-module communication, processing, and generation tasks rely solely on such representations—no codec tokens (quantized, discretized indices) are present anywhere in the processing pipeline.

2. Codec-Free Modality in Speech: SALMONN-omni

SALMONN-omni (Yu et al., 2024) exemplifies the codec-free modality for full-duplex speech understanding and generation. The architecture consists of:

Streaming Speech Encoder (SSE): Transforms input audio (speech, environmental sounds, synthetic echoes) into continuous auditory embeddings in real time, capturing both linguistic and paralinguistic cues.
LLM: Processes the embeddings (not tokens), integrating them with text and conversational context to predict word embeddings (as continuous vectors) for downstream response generation.
Streaming Speech Synthesizer (SSyn): Consumes LLM word embeddings to synthesize speech waveforms, entirely bypassing quantized codec tokens.

No stage involves mapping audio to discrete tokens. Synchronization between input and output is maintained via explicit time blocks ( $\Delta t$ ), and dialog state is managed by special control tokens (e.g., <start_speak>, <end_speak>, >), embedded as part of the model's state vector, not as codec tokens.

Full-duplex capability—with simultaneous listening (speech input), thinking (internal state transitions), and speaking (speech generation)—is achieved using an asynchronous update schedule, mediated by a composite objective:

$\mathcal{L}=\lambda_\text{text}\mathcal{L}_\text{text}+\lambda_\text{speech}\mathcal{L}_\text{speech}+\lambda_\text{think}\mathcal{L}_\text{think}$

Where speech and text losses operate purely over embeddings, and the "thinking" mechanism regulates non-speaking behavior.

3. Codec-Free and Alternative Modality Representation in Visual and Video Coding

Alternative codec-free approaches in visual information representation transfer the compression or enhancement role from pixel-wise or residual-wise coding to learned neural models or modality bottlenecks.

In "Towards Modality Transferable Visual Information Representation with Optimal Model Compression" (Lin et al., 2020), an enhancement layer consists of a highly compressed, scene-specialized DNN (based on SEFCNN), transmitted in lieu of conventional transform-coded residuals. Model weights are quantized during training and further compressed through entropy coding and transmission of residuals only:

$W_{Q}= \text{round}(W_{L} \times S_{c}) \quad W_{conv} = f_{\varphi}\left( W_{Q} \times \frac{1}{S_c} \right)$

The enhancement layer enables codec-free reconstruction or improvement by supplying model knowledge rather than signal residuals.

The CMVC paradigm (Zhang et al., 2024) for video coding reframes video compression as a cross-modal generation task. Keyframes (spatial content) are encoded as text/image, and motion as text or other semantic descriptors. At decoding, generative multimodal LLMs reconstruct video content (text-to-video, image-text-to-video), circumventing any codec-based bitstream. Quantitative and qualitative benchmarks demonstrate competitive reconstruction at ultra-low/ELB bitrates—semantic fidelity in TT2V, perceptual quality in IT2V—without conventional signal-based codecs.

4. Neural Compression and Modality-Agnostic Architectures

Modality-agnostic codec-free approaches have emerged in the context of neural compression and implicit neural representations (INRs). VC-INR (Schwarz et al., 2023) deploys a functional view of data: each signal is encoded as the parameters and gating masks of a shared neural representation, with no conventional codec at any stage. The process involves:

Meta-learned latent codes ( $\phi$ ) are mapped to soft, low-rank gating masks for subnetwork selection:

$G^{(l)}_{\mathrm{low} := \sigma(\mathbf{a}^{(l)} {\mathbf{b}^{(l)}^\top)$

Learned transforms compress and entropy code the latent ( $\hat{a}$ ), jointly optimizing the rate-distortion objective.

This enables direct neural model-based compression and reconstruction for images, audio, video, 3D shapes, and climate data—yielding state-of-the-art or superior bitrate vs. fidelity performance compared to JPEG 2000, MP3, AVC/HEVC, etc.

5. Semantic Alignment and Cross-modal Token Spaces

Recent advances in speech and audio representation leverage cross-modal alignment of semantic tokens for codec-free or codec-minimal systems.

SecoustiCodec (Qiang et al., 4 Aug 2025) enforces a single-codebook space for semantic tokens via VAE+FSQ quantization and contrastive frame-level alignment between speech and text. By disentangling semantic (S) and paralinguistic (G) representations,

$S + G \approx A$

semantic tokens can be consumed directly by LLMs or generation models, supporting TTS, ASR, and dialogue with minimal codec dependency. Paralinguistics are supplied as side-channel continuous variables, not quantized in the codebook.

UniAudio 1.5 (Yang et al., 2024) uses a multi-scale residual VQ scheme that directly maps audio features to token sequences in the LLM’s vocabulary. This permits a frozen LLM to treat audio as a "foreign language," supporting in-context few-shot learning and cross-modal reasoning—no fine-tuning or modality-adapter required. The total number of tokens per second for audio is dramatically reduced, and prompt-based interfaces enable both understanding and generation tasks with competitive reconstruction fidelity.

6. Comparative Perspectives and Impact

Codec-free modality offers several well-documented advantages:

End-to-end differentiability and trainability: No non-differentiable quantization bottlenecks, facilitating joint module optimization.

Richness of representation: Continuous embeddings can preserve prosodic, paralinguistic, or semantic nuances that are degraded or lost in quantized systems.

Unified token space: Cross-modal reasoning, transfer, and in-context learning are enabled, reducing modality heterogeneity.

Low latency and flexibility: Streaming operation is suited to real-time scenarios (SALMONN-omni), with asynchronous pipelines, barge-in, echo cancellation, and other advanced conversational features.

Scalability and extensibility: Modality-agnostic neural compression (VC-INR) and generative cross-modal coding (CMVC, UniAudio) generalize to diverse data domains.

Potential challenges include increased computational and data requirements, reduced interpretability of continuous embeddings, and legacy system compatibility.

7. Summary Table: Core Implementations and Modalities
System / Paradigm Modality Codec-Free Mechanism Bitrate Efficiency / Results

SALMONN-omni (Yu et al., 2024) Speech Embedding-based streaming, full-duplex Turn-taking, barge-in, echo cancellation; unified speech/text

VC-INR (Schwarz et al., 2023) Image/Audio/Video INR + latent-based compression Outperforms JPEG 2000, MP3, AVC/HEVC

SecoustiCodec (Qiang et al., 4 Aug 2025) Speech/Text Single-codebook semantic quantization SOTA PESQ 1.77/2.58 @ 0.27/1 kbps, 98% codebook utilization

UniAudio 1.5 (Yang et al., 2024) Speech/Audio LLM-Codec with LLM vocab tokenization 57 tokens/sec, competitive PESQ/STOI, few-shot generalization

CMVC (Zhang et al., 2024) Video Text/image-based semantic bottleneck TT2V/IT2V outperform VVC at ULB/ELB; -65–90% BD-rate

Codec-free modality enables unified, extensible, and semantically rich processing, compression, and generation for diverse data types, facilitating progress in real-time human-computer interaction, information broadcasting, and flexible cross-modal AI systems.

System / Paradigm	Modality	Codec-Free Mechanism	Bitrate Efficiency / Results
SALMONN-omni (Yu et al., 2024)	Speech	Embedding-based streaming, full-duplex	Turn-taking, barge-in, echo cancellation; unified speech/text
VC-INR (Schwarz et al., 2023)	Image/Audio/Video	INR + latent-based compression	Outperforms JPEG 2000, MP3, AVC/HEVC
SecoustiCodec (Qiang et al., 4 Aug 2025)	Speech/Text	Single-codebook semantic quantization	SOTA PESQ 1.77/2.58 @ 0.27/1 kbps, 98% codebook utilization
UniAudio 1.5 (Yang et al., 2024)	Speech/Audio	LLM-Codec with LLM vocab tokenization	57 tokens/sec, competitive PESQ/STOI, few-shot generalization
CMVC (Zhang et al., 2024)	Video	Text/image-based semantic bottleneck	TT2V/IT2V outperform VVC at ULB/ELB; -65–90% BD-rate