MUG-V 10B: 10B-Param Latent Video Model

Updated 4 July 2026

MUG-V 10B is a 10-billion-parameter latent video generation model that uses a Video VAE and DiT backbone to compress and generate video latents.
It integrates a multi-stage data processing pipeline with scene detection, multi-level filtering, and high-quality captioning to enhance training efficiency.
The model employs a staged curriculum with multimodal conditioning and Megatron-Core optimizations, achieving near-linear scaling on 500 H100 GPUs.

Searching arXiv for the specific model and closely related acronym collisions. MUG-V 10B is a 10-billion-parameter latent video generation model built around a Diffusion Transformer operating in a highly compressed video latent space, together with a training framework organized around four pillars: data processing, model architecture, training strategy, and infrastructure for large-scale video generation (Zhang et al., 20 Oct 2025). It is presented as a system for general video generation and, in particular, e-commerce-oriented generation, and it supports text-to-video, image-to-video, and text-plus-image-to-video conditioning. The work is notable not only for the model itself but also for open-sourcing model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement (Zhang et al., 20 Oct 2025).

1. Identity and scope

MUG-V 10B is defined in the source paper as a 10B latent video generation model based on a Video VAE and a large DiT backbone (Zhang et al., 20 Oct 2025). The system follows the latent video diffusion or latent flow transformer paradigm: a Video VAE compresses input video into a latent tensor, and a DiT learns to generate or denoise those latents under text conditioning and, when applicable, image or frame conditioning.

The paper emphasizes that the contribution is “full-stack” rather than narrowly architectural. Its stated target is high-efficiency training for large video generation models, with optimization spanning data preparation, compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. This places MUG-V 10B within the contemporary lineage of large latent video generators while distinguishing it by the explicit coupling of model design to systems engineering.

A common source of confusion is the acronym “MUG.” In arXiv usage, “MUG” also refers to unrelated work on interactive multimodal UI grounding (Li et al., 2022), multi-human 3D mesh reconstruction (Wu et al., 2022), and meeting understanding and generation benchmarks (Zhang et al., 2023). MUG-V 10B is not an extension of those projects; it is a distinct video generation system (Zhang et al., 20 Oct 2025).

2. Data processing and video preparation

The data pipeline is one of the paper’s central technical components. Raw videos are aggregated from public and internal sources, then filtered through video-level screening for licensing, privacy compliance, prohibited content, and diversity of scenes, subjects, and motion (Zhang et al., 20 Oct 2025). Videos are then split into clips using PySceneDetect together with CSS (Color-Struct SVM) from Koala-36M, with the stated aim of handling both sharp scene cuts and gradual transitions such as fades.

The clip filtering stack is multi-stage. Sharpness is evaluated by Laplacian variance, retaining frames in the interval $[200, 2000]$ ; aesthetic quality is filtered using a LAION-style predictor with a rejection threshold below $4.5$; motion amplitude is estimated by RAFT optical flow, rejecting clips whose average flow magnitude is below $1$ or above $20$; and a proprietary multimodal LLM filter fine-tuned on 24k labeled videos is used to detect overlays, borders, special effects, speed alteration, and camera shake (Zhang et al., 20 Oct 2025). These steps are presented as mechanisms for improving prompt adherence, convergence, and generalization.

Caption generation is likewise treated as infrastructure rather than annotation alone. The paper states that Qwen2-VL-72B is fine-tuned as a high-quality video captioner and then distilled into Qwen2-VL-7B for throughput, with captions designed to cover objects, appearance, motion, and background context. An LLM then parses captions into ontology-like tags for subjects, actions, and scenes; these tags are used for stratified sampling and near-duplicate detection. This suggests a training regime in which semantic metadata actively shapes both corpus balancing and data efficiency.

For later-stage post-training, the paper describes constructing a smaller human-labeled subset. It retains approximately the top $10\%$ of the pretraining set by composite quality score, rebalances toward human-centric clips, and then manually reviews clips for motion continuity, content stability, and visual fidelity. Around $0.3$ million high-quality clips are selected for supervised finetuning (Zhang et al., 20 Oct 2025).

3. Latent representation and model architecture

The architectural core consists of a high-compression Video VAE and a 10B DiT backbone (Zhang et al., 20 Oct 2025). The encoder maps a clip to a latent tensor

$Z \in \mathbb{R}^{(T/8) \times (H/8) \times (W/8) \times C},$

with bottleneck width $C = 24$ . This corresponds to $8\times$ compression in time, height, and width, or $512\times$ volumetric reduction before tokenization. The DiT then applies non-overlapping $4.5$0 patchification in latent space, giving an effective total compression of approximately $4.5$1 relative to pixel space (Zhang et al., 20 Oct 2025).

The Video VAE is initialized from an SDXL image VAE and extended to video through a unified hybrid convolutional architecture. A distinctive element is the “minimal encoding principle”: each latent token is encoded independently from its own local 8-frame chunk rather than through a temporally causal encoder with unequal context lengths across frames. The paper motivates this by arguing that the VAE should prioritize compression and reconstruction rather than generative temporal modeling, and that causal temporal convolutions create temporal information imbalance (Zhang et al., 20 Oct 2025).

The VAE objective is described as

$4.5$2

where $4.5$3 combines MSE, $4.5$4, and perceptual loss, and $4.5$5 is used only in final finetuning to sharpen texture and color (Zhang et al., 20 Oct 2025). After stabilization, the reconstruction term is augmented with an adaptive saliency weighting,

$4.5$6

which upweights dynamic high-frequency regions.

The generator is a 10B DiT with 56 transformer blocks. Rather than using an MM-DiT-style block, the paper adopts a block structure aligned with autoregressive LLMs: self-attention, then cross-attention, then FFN (Zhang et al., 20 Oct 2025). Visual tokens use 3D RoPE for spatiotemporal positional encoding. Text conditioning is applied through cross-attention, and the model additionally embeds scalar global conditioning signals such as diffusion timestep and frame rate through a shared MLP.

A further architectural choice concerns image or frame conditioning. Instead of simply concatenating condition latents, MUG-V 10B uses a masked latent strategy: conditioned regions are replaced with the provided frame or image latent, their diffusion timestep is set to zero, and the remaining tokens follow the ordinary noisy trajectory (Zhang et al., 20 Oct 2025). This conditioning scheme underlies support for image-to-video and text-plus-image-to-video generation and is also described as compatible with video continuation and first-, middle-, or last-frame conditioning.

4. Training strategy and parameter scaling

The training recipe is explicitly staged. The paper first trains a smaller DiT of approximately 2B parameters, with the same target depth of 56 blocks and hidden size 1728, to validate recipes and hyperparameters before full-scale expansion (Zhang et al., 20 Oct 2025). It then expands to 10B parameters through function-preserving hidden-size expansion. For a linear layer $4.5$7, $4.5$8, the paper gives the expansion

$4.5$9

with replicated inputs $1$0, so that outputs satisfy approximately $1$1. With expansion factor $1$2, parameter count grows by about $1$3 (Zhang et al., 20 Oct 2025).

Pretraining then proceeds through a three-stage curriculum. Stage 1 mixes image data with 360p video clips and anneals the image-to-video ratio so that video gradually dominates. Stage 2 remains at 360p but increases clip length from 2s to 5s. Stage 3 moves to 5s clips at 720p, curated from around 12M high-quality videos (Zhang et al., 20 Oct 2025). The paper notes that stages 1 and 2 provide more than $1$4 as many samples seen as stage 3 because of higher throughput.

After pretraining plateaus, the paper applies annealed supervised finetuning on around $1$5 million manually selected high-quality clips, using gradually decaying learning rate and post-EMA rather than standard online EMA. It then applies preference optimization. Two forms are emphasized: KTO for absolute good or bad labels targeting physical and rendering failures, and DPO for pairwise motion-quality preferences (Zhang et al., 20 Oct 2025). The appendix additionally mentions RDPO. The retained supervised objective is described as a regularizer against drift such as exaggerated motion amplitude or recurring texture patterns.

The paper states that the DiT is trained with flow matching objectives, but it does not give the exact flow-matching loss formula. It also omits many low-level optimization details, including optimizer choice, learning rate values, batch size, training steps, total FLOPs or GPU-hours, and precision mode. This suggests partial reproducibility at the systems level, but not full recipe disclosure in the paper text alone.

5. Infrastructure and training efficiency

Systems engineering is treated as a first-class contribution. The implementation is built on Megatron-Core and combines Data Parallelism, Tensor Parallelism, Pipeline Parallelism, and Sequence Parallelism (Zhang et al., 20 Oct 2025). The stated operational strategy is TP within a node, SP across the TP group to reduce activation memory, PP across nodes via point-to-point communication, and DP on top to scale effective batch size.

The paper argues that video DiT training is unusually demanding because of long latent sequences, full attention, and large activation memory. To address this, MUG-V 10B uses a collection of low-level optimizations: activation recomputation is disabled; asynchronous I/O, aggressive prefetching, and caching are used to keep devices fed; and dynamic balanced sampling distributes variable-cost video batches so that different ranks receive comparably expensive work (Zhang et al., 20 Oct 2025). These measures are presented as reducing stragglers and pipeline bubbles.

Kernel fusion is a major part of the efficiency story. The paper reports hand-written Triton kernels that fuse linear bias addition, per-pixel scale-and-shift modulation, and residual accumulation into a single kernel, and additional refactoring that enables LayerNorm plus QKV projection fusion, masked softmax folding into FlashAttention-2, and zero-padding removal through static shape inference (Zhang et al., 20 Oct 2025). The intended effect is to reduce global memory traffic and raise arithmetic intensity.

The headline systems claim is near-linear scaling on 500 NVIDIA H100 GPUs (Zhang et al., 20 Oct 2025). The paper also states that this is, to its knowledge, the first public release of large-scale video generation training code that exploits Megatron-Core for high training efficiency and near-linear multi-node scaling. A plausible implication is that the paper’s significance lies as much in open systems design as in end-task generation quality.

6. Evaluation, positioning, and limitations

On VBench-I2V, the paper reports a total score of 88.46 for MUG-V 10B (Zhang et al., 20 Oct 2025). The comparison table in the source places it above CogVideoX 5B at 86.70, STIV 8.7B at 86.73, HunyuanVideo 13B at 86.82, Wan2.1 14B at 86.86, Dynamic-I2V 5B at 88.45, and Step-Video 30B at 88.36, while below MAGI-1 24B at 89.28. The paper further states that at submission time MUG-V 10B ranked third on the VBench I2V leaderboard, behind MAGI-1 and a commercial system PI.

Its metric profile is uneven. The paper highlights $1$6 and $1$7 as best in the reported table, $1$8 as second-best shown, a Quality score of 81.55 as second-best shown, and $1$9 as a relative weakness (Zhang et al., 20 Oct 2025). This suggests strong subject and background consistency, along with competitive overall quality, but weaker camera-motion alignment.

The paper also reports domain-specific human evaluation for e-commerce video generation against HunyuanVideo and Wan 2.1, using 5-second clips sampled from public showroom images and judged by three annotators with majority consensus (Zhang et al., 20 Oct 2025). The stated protocol asks whether a clip is discernibly AI-generated, whether it preserves product consistency relative to the input image, and whether it is deployable as “high quality” in professional or cinematographic terms. The source states that MUG-V 10B achieves superior pass rate and high-quality rate in these blind evaluations, but it does not provide the exact numeric values in the supplied text.

The paper is explicit that MUG-V 10B is competitive rather than categorically dominant. It says the model “matches recent state-of-the-art video generators overall” and surpasses leading open-source baselines in human evaluations on e-commerce-oriented tasks (Zhang et al., 20 Oct 2025). At the same time, several limitations are acknowledged: residual minor artifacts, geometric distortions, physical implausibilities, imperfect material and texture preservation, hand and articulated-region errors, and limited controllability or conditioning faithfulness. It also emphasizes that training still requires substantial resources, including up to 500 H100 GPUs, and that parts of the data pipeline rely on internal data and proprietary filtering components.

These limitations define the paper’s stated future directions: stronger faithfulness and controllability, better fine-grained appearance fidelity, scaling to longer durations and higher resolutions, improved long-range temporal consistency, and continued algorithmic and systems advances for efficient training and inference (Zhang et al., 20 Oct 2025). Within that framing, MUG-V 10B occupies a specific position in the literature: a competitive 10B latent video generator whose distinctive contribution is the public release of a large-scale Megatron-Core-based training stack, rather than a claim of a wholly new generative paradigm.