Codec-Free End-to-End Models

Updated 29 November 2025

Codec-free end-to-end models are machine learning systems that map raw inputs directly to outputs by jointly optimizing encoding, compression, and task objectives without intermediate codebooks.
They integrate processes across domains such as image, audio, video, and language, simplifying workflows and often outperforming traditional modular pipelines.
These models demonstrate enhanced rate-distortion and task-specific metrics while posing challenges in computational efficiency, memory usage, and interpretability.

A codec-free end-to-end model is a machine learning system in which the signal representation, compression, and task objectives are integrated within a single, jointly trained network, fully replacing the hand-designed codec or tokenization pipeline. These models dispense with explicit intermediate codebooks, quantized speech/audio/text/image tokens, or fixed standards, instead learning their own latent or continuous embedding spaces directly optimized for the downstream objectives—whether rate–distortion, semantic preservation, or multi-modal reasoning. Their core characteristic is the elimination of manual decomposition into discrete atomic units (“codecs” in speech/image/video or “tokens” in NLP), enabling direct optimization over raw inputs. This paradigm is realized across imaging, speech, video, language, and communications, giving rise to highly expressive and adaptive systems with streamlined workflows and, in many cases, superior performance versus modular pipelines.

1. Definition and Rationale

Codec-free end-to-end models are distinguished by their direct mapping from raw input (pixels, waveforms, bytes) to output (reconstructed signals, task predictions) via a differentiable deep architecture, with no discrete codebook or tokenization step between encoding and decoding. For language, this means mapping directly from bytes to byte distributions without intermediate tokens (Belouadi et al., 2022). In audio/image/video compression, the classic chain—signal $\to$ codec $\to$ bits—is replaced by neural networks that learn bottleneck representations and quantizers in a task-driven, end-to-end fashion (Jia et al., 24 Nov 2025, Kankanahalli, 2017, Zhang et al., 16 Jan 2024, Zou et al., 2020, Chen et al., 2020). For neural wireless communication, both the modulation/demodulation and equalization/detection pipelines can be dissolved into a single, end-to-end learned chain operating directly in a continuous space (Cheng et al., 29 Oct 2025). In emerging multi-modal LLMs for speech/vision, codec-free means eschewing quantized audio or visual tokens and operating over continuous learned embeddings throughout all modality transformations (Yu et al., 27 Nov 2024).

Motivations include: the avoidance of suboptimal human-imposed bottlenecks; universality across domains and languages; sharply reduced task-specific preprocessing or hand-engineering; and the enabling of direct, differentiable optimization for perceptual or semantic objectives.

2. Architectures and Methodologies

2.1 End-to-End Compression and Transmission

In image compression, codec-free end-to-end pipelines consist of learnable encoders, low-dimensional bottleneck layers (often with differentiable quantization or latent VQ), and decoders, all trained against a joint rate–distortion or rate–perception loss (Jia et al., 24 Nov 2025, Zhang et al., 16 Jan 2024, Chamain et al., 2020). For speech, convolutional or cascaded residual autoencoders replace the entire psychoacoustic feature stack, waveform quantizer, and entropy coder (Kankanahalli, 2017, Zhen et al., 2019).

For video, latent-difference autoencoders with self-attention, or recurrent autoencoders operating on displaced residuals, learn to exploit temporal structure without explicit motion estimation or compensation (Zou et al., 2020, Chen et al., 2020). In wireless, neural transceivers are built as joint transmitter–receiver CNN/ResNet chains, with learnable constellation mapping, soft bit likelihood output, and neural equalization—fully abolishing conventional block-encoded modulation and pilot/aided channel estimation (Cheng et al., 29 Oct 2025).

2.2 Token-Free NLP

Codec-free LLMs (e.g., ByGPT) discard tokenization and operate on raw UTF-8 byte sequences, mapping directly from bytes to bytes using standard transformer or decoder-only architectures (Belouadi et al., 2022). These architectures are typically initialized from large-scale byte-level pretraining and adapt to task-specific structure via control symbols embedded as additional byte codes.

Codec-free LLMs for speech understanding/generation (e.g., SALMONN-omni) employ raw audio encoders to produce continuous embeddings, use transformer-based architectures to jointly reason over audio and linguistic modalities, and synthesize waveforms from output embeddings via streaming vocoders. There is no representation of the signal as quantized code vectors or tokens at any stage (Yu et al., 27 Nov 2024).

3. Training Paradigms and Objectives

All codec-free end-to-end systems are driven by joint objectives encompassing compression rate, signal fidelity, perceptual quality, and in many cases, task-specific or cross-modal supervision. The learning framework typically requires custom differentiable proxies for otherwise discrete or non-differentiable stages (quantization, entropy, symbol assignment), usually leveraging soft assignments and straight-through estimators. Representative objective forms include:

Image/audio: $L = D(x,\hat x) + \lambda \, R({\rm codes})$ , with $D$ a distortion or perceptual loss, $R$ a rate/entropy penalty, and possibly adversarial or feature-space matching losses (Jia et al., 24 Nov 2025, Kankanahalli, 2017, Zhang et al., 16 Jan 2024).
Video: multi-term rate–distortion including temporal dependencies (e.g., LSTM-driven, or attention-based fusion of sequential embeddings) (Zou et al., 2020, Chen et al., 2020).
Language: byte-level (or char-level) cross-entropy over the full input sequence (Belouadi et al., 2022).
Wireless: cross-entropy of bit-LLRs, rate constraints, and physical-layer metrics e.g. peak-to-average power ratio (PAPR) compliance (Cheng et al., 29 Oct 2025).
Multi-modal: combined text, speech, and “thinking” losses with asynchronous cross-modal scheduling (Yu et al., 27 Nov 2024).

4. Applications and Comparative Results

Codec-free end-to-end models have demonstrated strong empirical performance, often eclipsing traditional codecs or hybrid pipelines in both objective and subjective metrics.

Image compression: CoD achieves PSNR within −2.1% BD-Rate of VTM (H.266/VVC) while delivering better perceptual quality (FID, DISTS) at ultra-low bitrates (e.g., 0.0039 bpp) than GAN codecs and prior diffusion codecs (Jia et al., 24 Nov 2025). Frequency-oriented models can outperform all classical codecs, including VVC, on MS-SSIM, and preserve high task accuracy for detection and segmentation (Zhang et al., 16 Jan 2024, Chamain et al., 2020).

Speech coding: DNN-based systems match or exceed AMR-WB and even OPUS in objective SNR and PESQ at comparable bitrates with far fewer parameters (Kankanahalli, 2017, Zhen et al., 2019).

Video: Models like MOVI-Codec, which forgo motion estimation entirely in favor of spatio-temporal latent residuals, outperform H.264/AVC, HEVC/H.265, and H.266/VVC in MS-SSIM at higher bitrates (Chen et al., 2020). Self-attention–driven models with learned masks approach VVC rate–distortion curves (Zou et al., 2020).

Token-free NLP: Byte-level LMs such as ByGPT excel at producing and controlling character-level features (rhyme, alliteration) unachievable by tokenized transformers, while exhibiting reduced rote memorization (Belouadi et al., 2022).

AI-native wireless: End-to-end trained pilot/CP-free transceivers, operating with no fixed constellation, outperform state-of-the-art model-based, pilot-dependent OFDM at BER $\le 10^{-3}$ , improve throughput by 26.4%, and adapt rapidly to dynamic channel conditions via minimal parameter updates (Cheng et al., 29 Oct 2025).

Multimodal LLMs: Models employing continuous embedding streams (instead of quantized tokens) enable full-duplex conversational AI with lower turn-taking latency, improved ASR, streaming TTS, and natural integration of multiple modalities (Yu et al., 27 Nov 2024).

5. Integration, Interpretability, and System Design

An essential property of codec-free end-to-end models is their capacity for tight integration with downstream vision, language, or communication tasks. For “compression for machines,” codec–task joint fine-tuning yields up to +7% absolute mAP over standard codecs for detection at low rates, and selective adaptation (encoder/decoder/task) delivers a flexible trade-off between performance and compatibility (Chamain et al., 2020). Learned frequency-oriented transforms produce interpretable latent splits aligning with human-perceptual and semantic saliency, enabling scalable and selective transmission (Zhang et al., 16 Jan 2024).

By eliminating modular codebooks and tokenization, these models avoid cumulative error from hard boundaries and support direct, differentiable optimization for application-specific objectives, such as barge-in handling and echo cancellation in full-duplex LLMs (Yu et al., 27 Nov 2024). At the same time, their continuous or soft-quantized representations can increase compute, memory, or inference duration versus hand-designed pipelines or highly compact codecs (Belouadi et al., 2022, Zhang et al., 16 Jan 2024).

6. Challenges, Limitations, and Future Directions

Codec-free end-to-end approaches introduce both opportunities and open technical challenges. The lack of discrete bottlenecks increases memory bandwidth and runtime for long sequences (e.g., O( $n^2$ ) in byte-level LMs and neural codecs), necessitating ongoing work on factorized or sparse attention and neural pooling (Belouadi et al., 2022, Zhang et al., 16 Jan 2024). In wireless, real-time deployment requires careful PAPR control and low-overhead online adaptation modules (Cheng et al., 29 Oct 2025).

Interpretability, while improved in frequency-oriented image models, is generally limited in learned, deep bottleneck representations outside of explicitly structured decompositions (Zhang et al., 16 Jan 2024). Standardization, cross-domain generalization, and task-agnostic compression pose genuine difficulties.

Research trends include: scalable learning of more structured or sparse representations; extension to arbitrary-length or -form tasks (variable stanza poetry, high-res video); full one-step distillation of diffusion codecs; multi-task cross-modality integration; and hybrid architectures balancing end-to-end optimization with the practicalities of deployment, efficiency, and robustness (Jia et al., 24 Nov 2025, Belouadi et al., 2022, Yu et al., 27 Nov 2024).