Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 76 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 113 tok/s Pro

Kimi K2 188 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

ScheDFR: Dynamic Frame Rate Adaptation

Updated 2 July 2025

ScheDFR is a dynamic frame rate adaptation algorithm that optimizes neural speech codecs by merging acoustically similar frames.
It employs an unsupervised, dynamic programming-based segmentation strategy to allocate tokens based on content complexity.
The method supports adaptive bitrate control and outperforms fixed-frame-rate codecs in efficiency, quality, and cross-domain performance.

ScheDFR (Schedulable Dynamic Frame Rate) is a dynamic frame rate adaptation algorithm introduced in the context of neural speech codecs, specifically as a core component of the CodecSlime (ElastiCodec) system. It addresses the inefficiency inherent in fixed-frame-rate (FFR) neural speech codecs by enabling unsupervised, content-aware temporal redundancy compression. ScheDFR operates during inference to dynamically adjust the number of encoded frames (tokens) per time interval in response to the actual information density of the input speech, optimizing both the bitrate and reconstruction quality without relying on explicit supervision or architectural changes.

1. Dynamic Frame Rate Adaptation Mechanism

ScheDFR is designed to merge contiguous, acoustically similar frames in the encoded latent space, thereby reducing redundancy in steady-state speech regions (e.g., silences, long vowels) while preserving high temporal resolution in information-rich regions. The key steps are as follows:

Input: A sequence of continuous encoder outputs $\mathbf{h}^{T \times d_\mathrm{h}}$ and a target downsampling ratio $R_\mathrm{S}$ .
Segmentation Goal: Partition the sequence into segments $\mathbf{s}^* = \{s_1, s_2, \ldots, s_{T'}\}$ , with $s_i$ as the length of the $i$ -th segment and $\sum_i s_i = T$ . Segment lengths are constrained by $1 \leq s_i \leq U$ , where $U$ is the maximum segment length.
Downsampling Rule: Each segment $[h_{\sigma_i}, \ldots, h_{\sigma_i + s_i - 1}]$ is averaged:

$h'_t = \frac{1}{s_i} \sum_{j = \sigma_i}^{\sigma_i + s_i - 1} h_j$

where $\sigma_1 = 1$ , $\sigma_{i+1} = \sigma_i + s_i$ .

Optimal Scheduling: Segmentation is optimized to minimize intra-segment feature dispersion, quantified by a surrogate metric:

$\mathcal{J}_h(\mathbf{h}, \mathbf{s}) = -\sum_{i=1}^{T'} L(\sigma_i, s_i)$

$L(j, s) = \frac{1}{s} \sum_{a=j}^{j+s-2} \sum_{b=a+1}^{j+s-1} \lVert h_a - h_b \rVert_2$

The aim is to maximize $\mathcal{J}_h$ by dynamic programming (DP), as the problem has optimal substructure and manageable complexity $\mathcal{O}(TT'U)$ .

This segment-averaged downsampling ensures that frame tokens are densely allocated to acoustically complex regions and sparsely to redundant regions, all based on encoder features without explicit speech or linguistic supervision.

2. Comparative Efficiency and Performance Advancements

ScheDFR fundamentally improves the bitrate efficiency and quality trade-offs over traditional FFR codecs:

Token Allocation: Token spend matches acoustic variability—silences and prolonged vowels are compressed with fewer tokens, while rapid transitions receive more.
Empirical Metrics:
- When operating at $\sim$ 40 Hz ( $\sim$ 600 bps), a ScheDFR-based model achieves a Word Error Rate (WER) of 3.82%, compared to 5.59% for a standard FFR baseline, representing a 46% reduction in WER at equivalent bitrates.
- STOI, PESQ, SECS, and UTMOS scores are maintained or improved relative to FFR baselines.
- Even accounting for necessary duration bits, WER remains 8% lower for ScheDFR at the same total bitrate.
Generality: A single ScheDFR-equipped model outperforms separately trained FFR baselines across multiple frame rates (40–80 Hz) and exhibits lower WER on unseen multilingual test sets, indicating robust domain transfer capabilities.

3. Integration into Neural Codec Architectures

ScheDFR is implemented as an inference-time module in the CodecSlime (ElastiCodec) paradigm, leveraging a VQ-GAN-inspired backbone:

Backbone Overview: The architecture comprises a CNN + LSTM encoder, Finite Scalar Quantizer (FSQ), mirrored LSTM + CNN decoder, and adversarial discriminator.
Algorithm Placement: ScheDFR operates after the encoder LSTM and before the quantizer. It receives the full encoder feature sequence, determines the optimal segmentation, merges frames, and outputs a reduced-length feature sequence for quantization.
Quantization Synergy: The use of FSQ is particularly compatible with ScheDFR, enabling smooth transitions and high codebook utilization even at reduced frame rates.

No retraining of the underlying codec is required for different inference-time frame rates, and the DP search for segmentations is optimized for real-time throughput.

4. Rate-Quality Trade-offs and Flexibility

ScheDFR exposes bitrate-quality trade-offs not possible with conventional FFR codecs:

User-Controlled Bitrate: Users can specify target frame rates (and thus bitrates) at inference, arbitrating quality versus bandwidth.
Single Model, Multiple Rates: Unlike FFR codecs, which require separate models for each frame rate, ScheDFR supports a single model across a range of operating points.
Content-Timing Decoupling: The model explicitly encodes frame durations, enabling fine-grained timing control independent of information content.
Application Contexts: This flexibility is well-suited for streaming, resource-constrained devices, and ultra-low-latency scenarios.

5. Innovations and Contributions within CodecSlime

The primary technical advances embodied in ScheDFR are:

Unsupervised, Architecture-Agnostic DFR: ScheDFR is the first plugin-style algorithm endowing learned neural codecs with dynamic frame rate abilities, without external alignments or architectural entanglement.
Efficient Surrogate-Driven Scheduling: Using encoder feature similarity as a surrogate, highly correlated with downstream quality, enables effective segmentation via DP, bypassing the need for unstable RL or exhaustive search.
Melt-and-Cool Training Paradigm: In conjunction with "Melt-and-Cool" (training-time robustness to random merges and fine-tuning with optimal schedules), this approach regularizes the model for merged-frame decoding.
Minimal Overhead: Integration incurs negligible computational cost and operates orthogonally to the codec backbone, making it applicable across diverse codecs.

Within CodecSlime, ScheDFR is the engine of temporal redundancy reduction, delivering its core efficiency and perceptual gains.

6. Practical Implications and Deployment Considerations

ScheDFR enables neural codecs to achieve efficient audio compression and reconstruction quality:

Near-Limit Bitrates: Allows operation near theoretical speech compression lower bounds ( $\sim$ 50–100 bps) while maintaining intelligibility and fidelity.
Generalizability: Maintains performance across domain and language shifts.
Streaming Support: Remains effective under chunk-wise streaming with only marginal quality loss for reasonable segment sizes.
Downstream Task Benefits: Outputs provide improved intelligibility, beneficial for ASR, TTS, and S2ST applications.
Deployment: Supports adaptive rate control in production, allowing real-time, on-the-fly adjustment without retraining or model switching.

Summary Table: Core ScheDFR Formulas

Purpose	Formula
Frame Averaging (Downsample)	$h'_t = \frac{1}{s_i} \sum_{j = \sigma_i}^{\sigma_i + s_i - 1} h_j$
Optimal Scheduling Objective	$\mathbf{s}^\ast = \arg\max_{\mathbf{s} \in \mathcal{S}} \mathcal{J}_h(\mathbf{h}, \mathbf{s})$
Surrogate Metric Definition	$\mathcal{J}_h(\mathbf h, \mathbf s) = -\sum_{i=1}^{T'} L(\sigma_i, s_i)$ , $L(j,s) = \frac{1}{s}\sum_{a=j}^{j+s-2}\sum_{b=a+1}^{j+s-1} \lVert h_a-h_b \rVert_2$

ScheDFR constitutes a principled and practical approach to adaptive temporal compression for neural speech codecs, providing state-of-the-art efficiency and audio quality within a lightweight, broadly applicable inference-time scheduling framework.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to ScheDFR.

ScheDFR: Dynamic Frame Rate Adaptation

1. Dynamic Frame Rate Adaptation Mechanism

2. Comparative Efficiency and Performance Advancements

3. Integration into Neural Codec Architectures

4. Rate-Quality Trade-offs and Flexibility

5. Innovations and Contributions within CodecSlime

6. Practical Implications and Deployment Considerations

Summary Table: Core ScheDFR Formulas

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ScheDFR: Dynamic Frame Rate Adaptation

1. Dynamic Frame Rate Adaptation Mechanism

2. Comparative Efficiency and Performance Advancements

3. Integration into Neural Codec Architectures

4. Rate-Quality Trade-offs and Flexibility

5. Innovations and Contributions within CodecSlime

6. Practical Implications and Deployment Considerations

Summary Table: Core ScheDFR Formulas

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research