SkyReels-V4: Unified Multi-Modal Generation

Updated 27 February 2026

SkyReels-V4 is a unified multi-modal model that generates high-resolution video and synchronized audio using advanced diffusion techniques.
It introduces a dual-stream Multimodal Diffusion Transformer (MMDiT) architecture that employs joint self-attention and cross-attention to ensure temporal and semantic alignment.
The model’s efficient pipeline leverages low-resolution previews, keyframe-based refinement, and sparse attention to deliver cinematic outputs with flexible editing capabilities.

SkyReels-V4 is a unified multi-modal video foundation model designed for joint video-audio generation, inpainting, and editing at cinematic resolutions and durations. The model introduces a dual-stream Multimodal Diffusion Transformer (MMDiT) architecture, with one branch dedicated to synthesizing video and the other to generating temporally aligned audio. These streams are unified by a shared text encoder based on a frozen Multimodal LLM (MMLM), enabling the flexible conditioning of generation on rich multi-modal instructions including text, images, video clips, audio, and masks. SkyReels-V4 supports a spectrum of video and audio tasks, handling resolutions up to 1080p, frame rates of 32 FPS, and sequences of up to 15 seconds, while also achieving computational efficiency through a joint low-resolution and keyframe-based generation strategy, subsequently refined by super-resolution and interpolation modules (Chen et al., 25 Feb 2026).

1. Dual-Stream Multimodal Diffusion Transformer Architecture

The core architectural innovation in SkyReels-V4 is the dual-stream MMDiT, featuring two parallel Transformer-based diffusion networks of identical embedding dimensions: one for video latents and one for audio latents. The video branch is initialized from a pretrained DiT-style text-to-video model, while the audio branch is trained from scratch with an analogous architecture. At each of the initial M layers, the model maintains separate QKV projections and layer normalization/MLP parameters for the video or audio stream ( $x_v$ or $x_a$ ) and the shared text stream $x_t$ , but performs joint self-attention across both:

$\begin{aligned} Q_v,K_v,V_v &= \mathrm{QKV}_v(\mathrm{LN}_v(x_v)), \ Q_t,K_t,V_t &= \mathrm{QKV}_t(\mathrm{LN}_t(x_t)), \ [x_v';x_t'] &= \mathrm{SelfAttn}([Q_v;Q_t], [K_v;K_t], [V_v;V_t]). \end{aligned}$

After these initial joint-attention blocks, tokens are concatenated and processed through shared-parameter single-stream blocks for computational efficiency. To prevent semantic drift, each block incorporates reinforced text cross-attention:

$x_v'' = x_v' + \mathrm{CrossAttn}(Q=x_v', K=x_t, V=x_t).$

Bidirectional cross-attention is applied such that video streams attend to updated audio features and vice versa, facilitating tight temporal and semantic alignment:

$\begin{aligned} a_i' &= a_i + \mathrm{CrossAttn}(Q=a_i, K=v_i, V=v_i),\ v_i'' &= v_i' + \mathrm{CrossAttn}(Q=v_i', K=a_i', V=a_i'). \end{aligned}$

For temporal synchronization, both video and audio latents use 3D Rotary Positional Embeddings (RoPE), scaling the audio frequency such that temporally corresponding tokens share a unified attention grid. The shared MMLM text encoder processes multi-modal prompts—consisting of text, image, video, mask, and audio samples—whose embeddings are injected into self- and cross-attention layers throughout both branches, ensuring coherent conditioning across all modalities.

2. Diffusion and Flow-Matching Sequence Modeling

SkyReels-V4 frames video and audio generation as diffusion-based sequence forecasting, adopting a continuous-time flow-matching objective. The forward diffusion process applies Gaussian noise, parameterized as follows for latent $z_0$ and noise $\varepsilon \sim \mathcal N(0, I)$ at diffusion time $t \in [0, 1]$ :

$z_t = t\,\varepsilon + (1 - t)\,z_0,$

with $x_a$ 0

The model predicts the velocity field $x_a$ 1, approximating $x_a$ 2, optimizing the mean-squared error loss:

$x_a$ 3

This formulation is applied separately to each modality but with explicit cross-modal conditioning:

$x_a$ 4

Inference proceeds via numerical integration (Euler solver) to denoise and recover the original latent $x_a$ 5.

3. Unified Channel-Concatenation Inpainting and Editing

SkyReels-V4 unifies a broad range of video generation, inpainting, extension, and editing tasks using a channel-concatenation approach to masked diffusion. The input tensor is defined as:

$x_a$ 6

where $x_a$ 7 is the noisy video latent at timestep $x_a$ 8, $x_a$ 9 contains VAE-encoded conditional frames (with zeroed unconstrained regions), and $x_t$ 0 is a binary spatial-temporal mask (1 to preserve, 0 to generate). Specific task configurations are enumerated below:

Task	Mask $x_t$ 1 Specification	Description
Text-to-Video (T2V)	$x_t$ 2	Entire generation
Image-to-Video (I2V)	$x_t$ 3, others $x_t$ 4	Condition on 1st frame
Video Extension	$x_t$ 5	Extend with last $x_t$ 6 frames as context
Frame Interpolation	$x_t$ 7	Interpolate between given endpoints
Spatiotemporal Editing	Arbitrary $x_t$ 8	Free-form edit via mask specification

This method allows flexible, unified treatment of inpainting-style operations, including the injection of vision-referenced inpainting and editing using multi-modal prompts.

4. Efficient High-Resolution Generation Pipeline

To address the computational challenges of 1080p, 32 FPS, and 15 s video-audio generation, SkyReels-V4 implements a staged pipeline:

Joint Low-Resolution and Keyframe Prediction: The model generates a full-sequence low-resolution video (e.g., 270p/480 frames) and predicts a sparse set of high-resolution keyframes (e.g., every 8th frame at 1080p).
Refiner for Super-Resolution and Interpolation: A dedicated Refiner module assembles the final video by upsampling low-res latents to a high-res grid, insertion of predicted high-res keyframes, and concatenation with fresh noise for a second DiT pass. The Refiner simultaneously handles video super-resolution (VSR) and frame interpolation via a unified Transformer.
Video Sparse Attention (VSA): Each Refiner block adopts a two-stage attention mechanism—pooled attention to identify top-K spatio-temporal cubes followed by dense attention within these cubes—reducing computational cost by approximately 3× with minimal quality degradation.
Training Loss: The Refiner is trained end-to-end with the same flow-matching loss, extended to jointly optimize high-resolution frames and keyframe interpolation objectives.

This pipeline enables high-fidelity, long-duration outputs with strong temporal consistency across shots and synchronized audio, supporting multi-shot, cinema-level generation (Chen et al., 25 Feb 2026).

5. Training Regimen and Hyperparameters

SkyReels-V4 is trained through a multi-phase schedule leveraging massive scale across both visual and audio data:

Video Pretraining: Six stages over 11 epochs, starting from T2I at 256px on 3B images (3 epochs), progressing to T2V on 1B images + 400M videos, then expanding to I2V, V2V, and Edit (5% each), mixing resolutions up to 1080px, and finally incorporating image and video references in prompts (~20%).
Audio Pretraining: Training from scratch on hundreds of thousands of hours of speech, music, and SFX up to 15 s, for 3 epochs.
Joint Video-Audio Training: Includes T2V, T2A, and T2AV tasks, using a mixture of video and audio data at higher resolutions and durations (5–15 s).
Supervised Fine-Tuning: Conducted on 5M multi-modal conditioned videos (3 epochs), with a final stage on 1M curated high-quality videos.

Optimization employs AdamW with 0.05 weight decay, learning rate peaking at $x_t$ 9 with linear decay, and batch sizes on the order of 2048 per GPU, using 256 GPUs. Diffusion timesteps are sampled as continuous $\begin{aligned} Q_v,K_v,V_v &= \mathrm{QKV}_v(\mathrm{LN}_v(x_v)), \ Q_t,K_t,V_t &= \mathrm{QKV}_t(\mathrm{LN}_t(x_t)), \ [x_v';x_t'] &= \mathrm{SelfAttn}([Q_v;Q_t], [K_v;K_t], [V_v;V_t]). \end{aligned}$ 0, uniformly.

SkyReels-V4 accepts and interprets highly structured, multi-modal prompts encompassing text, images, videos, audio, and masks. The MMLM-based text encoder extracts multi-modal embeddings, which are directly integrated throughout both the video and audio synthesis branches. Illustrative prompt types include:

Text + Images + Audio: Prompts resolving to synchronized, semantically rich video and audio (e.g., character identity, lip sync, structured dialogue, camera motion, background matching, and musical overlays).
Image-to-Video + Mask-Based Editing: Enables targeted removal or replacement (e.g., object substitution across all frames), with the audio branch regenarating congruent ambient effects.
Motion Transfer: Supports animation of static images using motion extracted from reference videos, aligning style and pose, and transferring environment and SFX.
First-Frame + Artistic Effect: Applies region-specific transformations over a video sequence, importing effects or styles from reference media via in-context attention.

All such tasks are mediated by the unified channel-concatenation and RoPE-based time-offset mechanisms, supporting extensive editing, extension, and compositional diversity. The architecture integrates cross-modal attention at all abstraction levels, supporting fine-grained control and cross-stream synchronization.

SkyReels-V4 advances foundation models for video and audio by providing a robust, extensible framework for unified generation, inpainting, and editing, supporting instructions and contextual guidance across an expansive range of modalities while sustaining computational tractability at cinematic output resolutions (Chen et al., 25 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SkyReels-V4.