Papers
Topics
Authors
Recent
2000 character limit reached

Codec-Free Streaming Integration

Updated 17 December 2025
  • Codec-free streaming integration is defined by replacing traditional quantization steps with continuous mappings, enabling real-time multimedia transmission without discretization artifacts.
  • It combines soft video delivery pipelines and continuous embedding streaming for speech, achieving lower latency, improved SNR performance, and graceful quality degradation.
  • The approach supports full-duplex communication with efficient echo cancellation and state control, facilitating robust adaptation in variable or low-SNR network environments.

Codec-free streaming integration encompasses a set of methodologies for real-time multimedia transmission that eschew traditional quantization and entropy-based codecs in favor of direct manipulation of continuous or analog-valued signal representations. These approaches have emerged as a response to the limitations of digital coding pipelines—most notably, catastrophic quality cliffs, quantization artifacts, and error propagation—in variable or low-SNR network environments. Integration strategies enable seamless streaming pipelines for both wireless video (Fujihashi et al., 2021) and full-duplex speech understanding-generation (Yu et al., 27 Nov 2024, Yu et al., 17 May 2025), extending from physical layer encoding to high-level conversational AI systems.

1. Foundations and Theoretical Principles

Traditional digital streaming follows a “quantize → encode → channel-code → modulate” chain. Codec-free streaming replaces discrete coding and modulation steps with continuous mappings or pseudo-analog transmission. In soft-delivery systems for video, the flow is:

  • Linear transform (e.g., 3D-DCT, 2D-DWT) of input signal XskX \mapsto s_k,
  • Power-assigned scaling via xk=gkskx_k = g_k \cdot s_k, with kgk2λk=P\sum_k g_k^2 \lambda_k = P, where λk=E[sk2]\lambda_k = \mathbb{E}[|s_k|^2],
  • Analog (or pseudo-analog) modulation mapping xkx_k directly to physical symbols.

The power allocation for subcarrier kk in AWGN channels minimizes end-to-end MSE subject to a power constraint as:

gk=λk1/4Pj=1Nλj1/2g_k = \lambda_k^{-1/4} \sqrt{\frac{P}{\sum_{j=1}^N \lambda_j^{1/2}}}

and distortion evolves as a smooth function of SNR:

D(SNR)=k=1Nλk1+βkSNR,βk=gk2λkD(\mathrm{SNR}) = \sum_{k=1}^N \frac{\lambda_k}{1 + \beta_k\,\mathrm{SNR}},\quad \beta_k=g_k^2\lambda_k

yielding graceful degradation without “cliff-effects” or staircases (Fujihashi et al., 2021).

For speech and conversation, models such as SALMONN-omni eliminate discrete quantization by directly mapping raw waveform to sequence embeddings, process these via a LLM, then generate output waveforms from continuous LLM embeddings. This process, subsuming recognition and synthesis within a single end-to-end differentiable system, is entirely codec-free (Yu et al., 27 Nov 2024, Yu et al., 17 May 2025).

2. System Architectures and Integration Patterns

Video: Soft Delivery Pipeline

The system comprises five logical stages (Fujihashi et al., 2021):

  1. Signal Transformation: Group-of-Pictures (GoP) is transform-coded (e.g., 3D-DCT or DWT), yielding NN coefficients.
  2. Chunking and Metadata: Coefficients are partitioned into chunks/subbands with near-homogeneous variance; chunk variances λk\lambda_k are transmitted as protected metadata.
  3. Power Assignment: Each chunk is scaled using gkg_k as above, under total power PP.
  4. Pseudo-analog Modulation: Scaled symbols are mapped directly to float-valued transport payloads (e.g., RTP/UDP), with optional mixing for packet-loss resilience; only metadata is digitally encoded.
  5. Receiver Processing: Demodulate, MMSE estimate each coefficient:

s^k=gkλkgk2λk+σ2yk\hat{s}_k = \frac{g_k \lambda_k}{g_k^2 \lambda_k + \sigma^2} y_k

then invert the transforms to reconstruct video.

Speech: SALMONN-omni End-to-End (Editor’s term: “continuous embedding streaming architecture”)

Key stages (Yu et al., 27 Nov 2024, Yu et al., 17 May 2025):

  1. Frontend: Raw waveform sampled at 16 kHz → 80 ms frames → 80-dimensional log-Mel features, downsampled and mapped by lightweight convolutional and MLP-based transformations into a dd-dimensional continuous embedding space.
  2. LLM Core: Blocks of auditory embeddings are processed by the LLM, yielding either new embeddings for synthesis (in SPEAK mode) or “thinking”/state tokens (in LISTEN mode).
  3. Streaming Synthesis: In SPEAK mode, continuous LLM outputs are directly synthesized into waveforms; in LISTEN mode, embeddings propagate state and context for future decision making.

No quantization or codec vocabulary is present at any stage. Embedding sequences and state transitions are synchronized on fixed-length blocks (typically 80–100 ms).

3. Synchronization and Full-Duplex Operation

“Thinking” and “Shift” Mechanisms

To manage transitions and maintain low-latency streaming, LLMs employ explicit state control via special tokens per block (Yu et al., 17 May 2025):

  • > : Retain listening state. > > - <shift>: Toggle between listening (LISTEN) and speaking (SPEAK). > > The gating decision is computed as > > t=WgHt+bgR2,pt=softmax(t)\ell_t = W_g H_t + b_g \in \mathbb{R}^2, \quad p_t = \mathrm{softmax}(\ell_t) > > with state updated as st=argmaxc{think,shift}pt(c)s_t = \arg\max_{c\in\{\text{think}, \text{shift}\}} p_t(c). > > The input to each Transformer block is extended with new environmental embeddings, echo information (if in SPEAK state), and context tokens (Yu et al., 17 May 2025). Latency is determined by block duration plus step compute and synthesizer startup. > > ### Echo Cancellation and Barge-in > > Full-duplex models perform real-time echo cancellation by including both raw (possibly echoing) environment embeddings and assistant-generated embeddings as input to the LLM (Yu et al., 27 Nov 2024, Yu et al., 17 May 2025). The LLM, via self-attention and masking, can learn to subtract or ignore echo, rendering explicit echo-suppression modules unnecessary. Barge-in detection utilizes similar mechanisms, where environmental embedding shifts trigger transition to LISTEN state in response to interruption. > > ## 4. Comparative Performance and Deployment > > Empirical comparisons between codec-free pipelines and quantized/codec-injected baselines (e.g., Moshi, SyncLLM) demonstrate: > > - Lower end-to-end latency (block size ≈100 ms for soft/embedding pipelines vs. ≈200–300 ms for codec-based systems) (Yu et al., 27 Nov 2024). > > - Substantial performance gains, e.g., 30% relative improvement in full-duplex dialogue and turn-taking accuracy, large reductions in WER (down to ~2.4% on ASR test-clean), S2S QA accuracy ranging 80–92%, and S2S GPTScore (AlpacaEval) up to 4.05/5.0 (Yu et al., 17 May 2025). > > - Higher SI-SDR for enhancement, more robust barge-in (F1 up to 0.95) and echo cancellation (Yu et al., 27 Nov 2024). > > - Compression artifacts such as cliff and staircase effects are completely absent in video streaming; perceptual quality (e.g., PSNR) tracks SNR smoothly for all clients (Fujihashi et al., 2021). > > Codec-free integration enables simultaneous streaming for all clients, efficient multicast, and graceful adaptation to fluctuating channel quality. In video, all users consume the same analog stream; detail recovery is proportional to individual SNR without explicit layers (Fujihashi et al., 2021). > > ## 5. Training, Fine-tuning, and Loss Functions > > Codec-free LLM-based streaming systems are trained end-to-end with multi-term losses (Yu et al., 27 Nov 2024, Yu et al., 17 May 2025): > > L=λtextLtext+λspeechLspeech+λthinkLthinkL = \lambda_{\text{text}} L_{\text{text}} + \lambda_{\text{speech}}L_{\text{speech}} + \lambda_{\text{think}} L_{\text{think}} > > For full-duplex speech: > > - LtextL_{\text{text}}: cross-entropy for ASR/text targets. > > - LspeechL_{\text{speech}}: spectrogram (L1/L2 or MSE) losses for synthesis quality. > > - LthinkL_{\text{think}}: negative reward for excessive placeholder usage. > > - Additional terms: LshiftL_{\text{shift}} for correct state transitions, LbarL_{\text{bar}} for barge-in, LechoL_{\text{echo}} for echo cancellation via embedding residuals. > > Reinforcement learning—specifically Direct Preference Optimization (DPO)—is applied post-SFT to calibrate turn-taking, yielding further improvements in full-duplex conversational metrics (Yu et al., 17 May 2025). > > ## 6. Practical Integration and Coexistence with Digital Systems > > Codec-free modes can be retrofitted into existing streaming servers and transport infrastructure. For video streaming, only the digital quantizer-entropy and codec layers are bypassed; floating-point symbols are inserted into RTP/UDP packet payloads in place of H.264 or other bitstreams. Metadata such as chunk variances or packetization information are protected and transmitted over a digital control channel, ensuring backward compatibility (Fujihashi et al., 2021). > > Fallback or hybrid strategies are supported: in low-SNR conditions, systems can switch to a digital codec or transmit a small digital base-layer (HDA). For bandwidth limitation in multicast, lowest-energy (e.g., high-frequency) chunks can be dropped and signaled by bitmap, preserving principal content for lower-SNR users (Fujihashi et al., 2021). > > In speech and conversational systems, the streaming inference pipeline employs block-synchronous chunking, explicit state transitions, and small buffers to bound latency and maintain synchronization. The absence of quantization and VAD modules leads to lower memory and compute overhead with robustness to interruptions and context shifts (Yu et al., 27 Nov 2024, Yu et al., 17 May 2025). > > ## 7. Significance, Limitations, and Future Directions > > Codec-free streaming integration enables fundamentally different quality-versus-SNR trade-offs compared to traditional digital pipelines. These approaches are uniquely suited for heterogeneous multicast, variable channel environments, and full-duplex communication scenarios where rigid quantization or codec-based structures introduce unrecoverable artifacts or excessive latency. > > Empirical results confirm superior full-duplex dialogue management, speech enhancement, and echo cancellation in practical deployments, with meaningful reductions in required training data compared to token-based baselines (Yu et al., 17 May 2025). However, practical challenges remain in metadata overhead minimization, fairness in channel utilization, and synthesizer efficiency, particularly for large-scale deployments. > > A plausible implication is that as end-to-end neural architectures mature and hardware support for continuous-valued streaming propagates, codec-free integration will become increasingly prevalent in both wireless multimedia and conversational AI platforms. The cited body of research forms the basis for further advancements in joint source-channel coding, neural inference over continuous streams, and unified full-duplex streaming models (Fujihashi et al., 2021, Yu et al., 27 Nov 2024, Yu et al., 17 May 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Codec-free Streaming Integration.