Papers
Topics
Authors
Recent
2000 character limit reached

FlashCodec: Multi-GPU Video Decoding

Updated 26 December 2025
  • FlashCodec is a collaborative multi-GPU video decoding library that eliminates the preprocessing bottleneck in multimodal LLM pipelines.
  • It partitions video decoding tasks across NVIDIA NVDEC engines using fine-grained, stall-free scheduling to maximize hardware utilization and minimize latency.
  • Quantitative evaluations demonstrate up to 9.1× speedup and 4.4× throughput improvement, markedly enhancing performance in video-heavy workloads.

FlashCodec is a collaborative multi-GPU video decoding library designed to eliminate the preprocessing bottleneck in multimodal LLM (MLLM) pipelines, particularly for video-heavy workloads. As multimodal systems integrate vision and language processing, the initial step of decoding video or image data (multimodal preprocessing) often dominates Time-to-First-Token (TTFT), substantially impacting both latency and throughput for MLLM serving. FlashCodec introduces a fine-grained, stall-free, multi-GPU architecture that partitions video decoding tasks across all available NVIDIA NVDEC engines, achieving sub-second latency for individual video decoding and significantly increasing system-wide throughput. It forms one component of an end-to-end serving stack jointly optimized with UnifiedServe to maximize resource utilization and eliminate inter-stage bottlenecks in vision-to-text inference (Zhao et al., 19 Dec 2025).

1. Role in the Multimodal LLM Pipeline

MLLM inference typically consists of three distinct stages:

  1. Multimodal Preprocessing: Decoding of input images or videos into patch tokens.
  2. Vision Encoding: Computation of visual embeddings via models such as ViT.
  3. LLM Inference: Prefill and decode stages over combined visual and text tokens.

In video-centric workloads, multimodal preprocessing—particularly CPU-bound H.264/H.265 decoding—dominates TTFT, taking up to several seconds for short 720p clips. Even with GPU acceleration, naive parallel assignment to individual NVDEC engines yields 500–1000 ms per video. Downstream, the vision encoder is compute-intensive and incompatible with co-batching alongside LLM decode, causing further fragmentation and blocking.

FlashCodec directly targets this bottleneck by enabling fine-grained, intra-video parallelization and GPU-wide collaboration, slashing TTFT and laying the foundation for improved overall system latency and utilization.

2. Collaborative Multi-GPU Video Decode Design

2.1 GOP-Based Task Partitioning

FlashCodec exploits the organizational unit of Groups of Pictures (GOPs) within compressed videos. By parsing container metadata to extract GOP boundaries G={g1,g2,,gM}G = \{g_1, g_2, \dots, g_M\}, FlashCodec assigns each gGg \in G to a specific decoder engine slot (i,e)(i, e), where ii indicates the GPU and ee an NVDEC engine on that GPU. The GOP list is split into PNiP \cdot N_i segments Si,eS_{i,e} (for PP GPUs each with NiN_i NVDECs), so that within a given segment, GOP decoding is sequential, while across segments, decoding occurs in parallel.

2.2 Stall-Free Scheduling

NVDEC runtimes per GOP fluctuate considerably (up to 30–40 ms). Rather than batching an entire video to one engine (which causes idle time for faster engines), FlashCodec implements a stall-free scheduler: each GOP or small group thereof (GsG_s) is enqueued as an independent mini-task. As soon as any decoder finishes, it is immediately assigned the next available GsG_s. This policy minimizes idle hardware, approaching maximal decoder utilization.

2.3 Memory Efficiency

Instead of pre-allocating large GPU buffers for every incoming video, FlashCodec postpones allocation until a rank’s GOP decoding is finished. At this point, exactly the bytes needed for AVFrame-to-tensor conversion are allocated, combining with the fine-grained scheduling to tightly bound peak memory use—especially critical under asynchronous request load.

3. GPU-Internal Scheduling Formalism

Given GOPs G={g1,...,gM}G=\{g_1, ..., g_M\}, PP GPUs, and NiN_i NVDEC engines per GPU, the goal is to schedule each gGg \in G uniquely to a slot (i,e)(i, e) to minimize the makespan TmaxT_{\max}, subject to:

  • i=1P,e=1NiSi,e=G\bigcup_{i=1\ldots P,\,e=1\ldots N_i} S_{i,e} = G
  • Si,eSj,f=S_{i,e} \cap S_{j,f} = \emptyset for all (i,e)(j,f)(i, e) \ne (j, f)
  • gSi,etdec(g)Tmax\sum_{g\in S_{i,e}} t_\text{dec}(g) \le T_{\max} for all i,ei, e

where tdec(g)t_\text{dec}(g) is the (profiled) decode time. In the ideal stall-free scheduling regime, makespan empirically approaches

Tdecodemaxi,egSi,etdec(g)T_\text{decode} \approx \max_{i,e} \sum_{g\in S_{i,e}} t_\text{dec}(g)

with a lower bound

TdecodegGtdec(g)iNiT_\text{decode}^* \approx \frac{\sum_{g\in G} t_\text{dec}(g)}{\sum_i N_i}

In experiments, this delivers 2.8×2.8\times4.4×4.4\times speedup over single-GPU or CPU decoding, and 5.7×5.7\times9.1×9.1\times on four A100 nodes (Zhao et al., 19 Dec 2025).

4. Integration with UnifiedServe and Resource Sharing

Post-decoding, FlashCodec transfers patch tokens to UnifiedServe’s shared staging buffers using IPC. UnifiedServe operates vision encode, prefill, and decode workers as concurrent processes on the same GPU cluster, orchestrated under NVIDIA MPS. Each scheduling iteration tt is characterized by:

  • EtE_t: number of visual frames for vision encode (compute-bound)
  • PtP_t: number of tokens (text + visual) for prefill (compute-bound)
  • DtD_t: number of tokens for LLM decode (memory-bound)

Subject to compute capacity CC and memory bandwidth BB per GPU, per-iteration budgets α,τ\alpha, \tau are set so that: Etα,Ptτ,Et+Pt+DtCE_t \leq \alpha,\quad P_t \leq \tau,\quad E_t+P_t+D_t \leq C

Mencode(Et)+Mprefill(Pt)+Mdecode(Dt)MGPUM_\text{encode}(E_t) + M_\text{prefill}(P_t) + M_\text{decode}(D_t) \leq M_\text{GPU}

Degradation factors δXY\delta_{X \to Y} are empirically measured for co-running phases, and budgets are set to bound worst-case slowdowns, specifically: δprefilldecode1+ϵ,δencodedecode1+ϵ\delta_{\text{prefill} \to \text{decode}} \leq 1 + \epsilon, \quad \delta_{\text{encode} \to \text{decode}} \leq 1 + \epsilon' This ensures stable TBT while maximizing hardware utilization. Throughput is approximated as

Throughputt(Pt+Dt)/(wall-clock time of iteration t)\text{Throughput} \approx \sum_t (P_t + D_t) / (\text{wall-clock time of iteration } t)

yielding up to 4.4×4.4\times throughput improvement on long-video workloads relative to prior split or monolithic serving systems.

5. Implementation Aspects

FlashCodec comprises approximately 5.6 K lines of C++/CUDA, built as an extension of TorchCodec. Three primary Python APIs (all GIL-releasing) are exposed:

  • analyse_bitstream(S) → (key_id, metadata, #GPUs)
  • add_decoding_request(key_id)
  • get_decoding_output() → (key_id, GPU tensor)

Video decoding leverages FFmpeg backends for H.264/H.265, on-chip JPEG engines for images, and falls back to CPU for non-JPEG images. Stall-free scheduling and the “current-worker-prioritized” dispatcher utilize a compact thread pool with mutex/condition primitives, as detailed in the publication’s Algorithm 2. UnifiedServe brings an additional ~2 K lines for buffer management (IPC), a shared KV-cache, NCCL orchestration, and a chunked “Prefill–Encode” scheduler (Algorithm 4).

6. Quantitative Evaluation and System Impact

Benchmarking on a 4 × A100 configuration with Qwen2.5-VL-32B and InternVL3-38B models demonstrated:

  • Decode latency: 2.8–4.4× faster on a single GPU, 5.7–9.1× on four, e.g., $800$ ms → $140$ ms per 720p H.264/H.265 video.
  • End-to-end TTFT: up to 80% reduction versus split designs; 50% below chunked-prefill monolithic pipelines.
  • TBT Tail: 83% lower P99P_{99} latency than monolithic baselines, matching or outperforming existing split systems.
  • Throughput: up to 4.4× gain in tokens/second over state-of-the-art systems for MLVU long-video workloads.
  • SLO Attainment: Up to 3.0× more sustained requests at fixed TTFT/TBT, or 1.5× tighter SLOs at unchanged throughput.

The combination of hardware-parallelized, stall-free decoding and interference-aware resource partitioning enables sub-second TTFT on videos of several minutes' duration, with stable low-tail TBT and best-in-class throughput metrics for disaggregated multimodal LLM inference (Zhao et al., 19 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to FlashCodec.