FlashCodec: Multi-GPU Video Decoding
- FlashCodec is a collaborative multi-GPU video decoding library that eliminates the preprocessing bottleneck in multimodal LLM pipelines.
- It partitions video decoding tasks across NVIDIA NVDEC engines using fine-grained, stall-free scheduling to maximize hardware utilization and minimize latency.
- Quantitative evaluations demonstrate up to 9.1× speedup and 4.4× throughput improvement, markedly enhancing performance in video-heavy workloads.
FlashCodec is a collaborative multi-GPU video decoding library designed to eliminate the preprocessing bottleneck in multimodal LLM (MLLM) pipelines, particularly for video-heavy workloads. As multimodal systems integrate vision and language processing, the initial step of decoding video or image data (multimodal preprocessing) often dominates Time-to-First-Token (TTFT), substantially impacting both latency and throughput for MLLM serving. FlashCodec introduces a fine-grained, stall-free, multi-GPU architecture that partitions video decoding tasks across all available NVIDIA NVDEC engines, achieving sub-second latency for individual video decoding and significantly increasing system-wide throughput. It forms one component of an end-to-end serving stack jointly optimized with UnifiedServe to maximize resource utilization and eliminate inter-stage bottlenecks in vision-to-text inference (Zhao et al., 19 Dec 2025).
1. Role in the Multimodal LLM Pipeline
MLLM inference typically consists of three distinct stages:
- Multimodal Preprocessing: Decoding of input images or videos into patch tokens.
- Vision Encoding: Computation of visual embeddings via models such as ViT.
- LLM Inference: Prefill and decode stages over combined visual and text tokens.
In video-centric workloads, multimodal preprocessing—particularly CPU-bound H.264/H.265 decoding—dominates TTFT, taking up to several seconds for short 720p clips. Even with GPU acceleration, naive parallel assignment to individual NVDEC engines yields 500–1000 ms per video. Downstream, the vision encoder is compute-intensive and incompatible with co-batching alongside LLM decode, causing further fragmentation and blocking.
FlashCodec directly targets this bottleneck by enabling fine-grained, intra-video parallelization and GPU-wide collaboration, slashing TTFT and laying the foundation for improved overall system latency and utilization.
2. Collaborative Multi-GPU Video Decode Design
2.1 GOP-Based Task Partitioning
FlashCodec exploits the organizational unit of Groups of Pictures (GOPs) within compressed videos. By parsing container metadata to extract GOP boundaries , FlashCodec assigns each to a specific decoder engine slot , where indicates the GPU and an NVDEC engine on that GPU. The GOP list is split into segments (for GPUs each with NVDECs), so that within a given segment, GOP decoding is sequential, while across segments, decoding occurs in parallel.
2.2 Stall-Free Scheduling
NVDEC runtimes per GOP fluctuate considerably (up to 30–40 ms). Rather than batching an entire video to one engine (which causes idle time for faster engines), FlashCodec implements a stall-free scheduler: each GOP or small group thereof () is enqueued as an independent mini-task. As soon as any decoder finishes, it is immediately assigned the next available . This policy minimizes idle hardware, approaching maximal decoder utilization.
2.3 Memory Efficiency
Instead of pre-allocating large GPU buffers for every incoming video, FlashCodec postpones allocation until a rank’s GOP decoding is finished. At this point, exactly the bytes needed for AVFrame-to-tensor conversion are allocated, combining with the fine-grained scheduling to tightly bound peak memory use—especially critical under asynchronous request load.
3. GPU-Internal Scheduling Formalism
Given GOPs , GPUs, and NVDEC engines per GPU, the goal is to schedule each uniquely to a slot to minimize the makespan , subject to:
- for all
- for all
where is the (profiled) decode time. In the ideal stall-free scheduling regime, makespan empirically approaches
with a lower bound
In experiments, this delivers – speedup over single-GPU or CPU decoding, and – on four A100 nodes (Zhao et al., 19 Dec 2025).
4. Integration with UnifiedServe and Resource Sharing
Post-decoding, FlashCodec transfers patch tokens to UnifiedServe’s shared staging buffers using IPC. UnifiedServe operates vision encode, prefill, and decode workers as concurrent processes on the same GPU cluster, orchestrated under NVIDIA MPS. Each scheduling iteration is characterized by:
- : number of visual frames for vision encode (compute-bound)
- : number of tokens (text + visual) for prefill (compute-bound)
- : number of tokens for LLM decode (memory-bound)
Subject to compute capacity and memory bandwidth per GPU, per-iteration budgets are set so that:
Degradation factors are empirically measured for co-running phases, and budgets are set to bound worst-case slowdowns, specifically: This ensures stable TBT while maximizing hardware utilization. Throughput is approximated as
yielding up to throughput improvement on long-video workloads relative to prior split or monolithic serving systems.
5. Implementation Aspects
FlashCodec comprises approximately 5.6 K lines of C++/CUDA, built as an extension of TorchCodec. Three primary Python APIs (all GIL-releasing) are exposed:
analyse_bitstream(S)→ (key_id, metadata, #GPUs)add_decoding_request(key_id)get_decoding_output()→ (key_id, GPU tensor)
Video decoding leverages FFmpeg backends for H.264/H.265, on-chip JPEG engines for images, and falls back to CPU for non-JPEG images. Stall-free scheduling and the “current-worker-prioritized” dispatcher utilize a compact thread pool with mutex/condition primitives, as detailed in the publication’s Algorithm 2. UnifiedServe brings an additional ~2 K lines for buffer management (IPC), a shared KV-cache, NCCL orchestration, and a chunked “Prefill–Encode” scheduler (Algorithm 4).
6. Quantitative Evaluation and System Impact
Benchmarking on a 4 × A100 configuration with Qwen2.5-VL-32B and InternVL3-38B models demonstrated:
- Decode latency: 2.8–4.4× faster on a single GPU, 5.7–9.1× on four, e.g., $800$ ms → $140$ ms per 720p H.264/H.265 video.
- End-to-end TTFT: up to 80% reduction versus split designs; 50% below chunked-prefill monolithic pipelines.
- TBT Tail: 83% lower latency than monolithic baselines, matching or outperforming existing split systems.
- Throughput: up to 4.4× gain in tokens/second over state-of-the-art systems for MLVU long-video workloads.
- SLO Attainment: Up to 3.0× more sustained requests at fixed TTFT/TBT, or 1.5× tighter SLOs at unchanged throughput.
The combination of hardware-parallelized, stall-free decoding and interference-aware resource partitioning enables sub-second TTFT on videos of several minutes' duration, with stable low-tail TBT and best-in-class throughput metrics for disaggregated multimodal LLM inference (Zhao et al., 19 Dec 2025).