2000 character limit reached

Waver: Unified High-Fidelity Video Generation

Updated 26 September 2025

The paper introduces a Hybrid Stream DiT architecture that combines dual-stream and single-stream transformer layers to optimize video-text alignment and computational efficiency.
Waver unifies text-to-video, image-to-video, and text-to-image tasks through a flexible conditioning mechanism and dual-encoder system, ensuring high motion fidelity and temporal consistency.
Empirical evaluations and cascade refinement strategies show that Waver outperforms open-source models on leaderboards while achieving native 720p synthesis upscaled to 1080p.

Waver is a high-performance foundation model designed for unified image and video generation, capable of directly synthesizing videos of 5–10 seconds at native 720p and upscaling output to 1080p resolution. It supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) tasks within a single integrated framework, introducing architectural and data curation innovations resulting in superior motion fidelity and temporal consistency. Waver ranks among the top-three models on prominent public leaderboards, consistently outperforming open-source solutions and matching state-of-the-art commercial systems (Zhang et al., 21 Aug 2025). The following sections detail the architectural foundations, unified generative capabilities, data pipeline and quality control, empirical performance metrics, training/inference strategy, and broader impact of the Waver model.

1. Hybrid Stream Diffusion Transformer Architecture

Waver’s architecture centers on a unique “Hybrid Stream DiT” design combining dual-stream and single-stream transformer processing for modality alignment and efficiency. Early layers of the transformer operate in a dual-stream mode, with separate weights for high-dimensional video tokens and text tokens. This separation allows strong early modality-specific representation learning and facilitates precise alignment of video and textual semantics. As processing progresses, the architecture transitions to single-stream shared-parameter layers, leveraging joint spatial-temporal and semantic representations for computational efficiency and convergence stability.

A key aspect is the hybrid positional encoding, blending 3D rotary position embedding (RoPE)—for relative spatial-temporal information—with factorized learnable absolute position embeddings, thus accommodating both global and fine-grained alignment across modalities. To further enforce semantic consistency, Waver applies a representation alignment loss: $L_{\text{align}} = -\mathbb{E}_{x}\left[\frac{1}{N}\sum_{i}\frac{\mathbf{h}_g,i \cdot \mathbf{f}_i}{\|\mathbf{h}_g,i\| \cdot \|\mathbf{f}_i\|}\right]$ where $\mathbf{h}_g$ is the projected DiT feature and $\mathbf{f}$ is the semantic feature from a multimodal LLM.

2. Unified Generative Framework: T2V, I2V, T2I Tasks

Unlike piecemeal or task-specific video synthesis approaches, Waver unifies three central video/image generation paradigms within a single model:

Text-to-Video (T2V): Generation is conditioned solely on a textual prompt, producing coherent video sequences with accurate semantic content and complex motion.
Image-to-Video (I2V): Beginning from arbitrary input images (encoded as VAE latents), the model extends these static frames into temporally consistent video clips, allowing rich motion generation while preserving input details.
Text-to-Image (T2I): The architecture can generate high-resolution still images from textual prompts as an auxiliary task, assisted by the same transformer backbone and loss scheduling.

This is enabled by a flexible conditioning mechanism: the model’s input includes a noisy primary latent ( $V$ ), a conditional frames tensor ( $I$ ), and a binary mask ( $\text{Mask}$ ) indicating conditioned frames. Adjusting conditioning channels allows seamless switching among generative tasks.

Waver further employs a dual-encoder textual system: integrating flan-t5-xxl and Qwen2.5-32B-Instruct enhances textual prompt understanding and enables robust prompt adherence across tasks.

3. Data Curation and Perceptual Quality Filtering

Robust high-fidelity generation demands exhaustive data curation, for which Waver employs a multi-stage pipeline:

Video Segmentation: Tools such as PySceneDetect and DINOv2 feature similarity segment original videos into coherent clips.
Hierarchical Filtering: Clips are filtered for minimum standards (resolution, bitrate, frame rate) and aesthetic/artifact constraints (watermark elimination, flicker detection).
MLLM-Based Quality Annotation: Over one million video clips are manually annotated across 13 perceptual quality dimensions, forming a label set to fine-tune a VideoLLaMA3 multimodal LLM, which assesses and filters the final training set for optimal sample quality and balance.

This aggressive curation and annotation protocol is justified by the necessity of reducing spurious motion artifacts and semantic drift, guaranteeing that training and inference operate on data distributions consistent with high-quality, diverse real-world videos.

4. Empirical Performance and Leaderboard Results

Waver is evaluated through human-judged benchmarks focused on three principal axes—motion, image, and prompt alignment:

Motion Quality: Natural action rendering, interaction realism, and absence of visible distortion/artifacts.
Visual Quality: Sharpness, aesthetic coherence, color accuracy, and overall photorealism.
Prompt Following: Attuned subject presence and exactness in scene realization.

According to leaderboard data (Artificial Analysis, 2025-07-30), Waver ranks among the top three models for both T2V and I2V, outperforming all open-source solutions and matching or exceeding commercial offerings in motion fidelity, temporal consistency, and image clarity. The system achieves higher scores in both motion quality and prompt adherence compared to contemporary models in open competition.

5. Training and Inference Strategy

Waver introduces an explicit multi-stage training protocol:

Curriculum Optimization: Progressive joint training over increasing video/image resolutions (starting at 192p, reaching 720p native), sequentially mastering T2I, T2V, and I2V in a unified configuration. Early task mixing fosters stronger semantic and motion consistency at lower computational cost.
Cascade Refiner: Post-720p synthesis, a cascade refinement module leverages flow-matching, windowed attention, and noise scheduling to upsample and artifact-correct output to 1080p, ensuring high-resolution fidelity.
Noise Scheduling: Distinct scheduling techniques (logit-normal, mode-based) adjust the diffusion process parameters for different generative tasks to optimize both convergence and generalization.

Inference follows the same staged pipeline: a low-resolution video is synthesized by the unified DiT, followed by cascade-based upscaling via flow matching.

6. Applications and Impact

As a unified high-fidelity video model supporting multimodal input and output, Waver’s capabilities facilitate a wide spectrum of use cases:

Content Creation: Automated generation for e-commerce displays, virtual try-on, product showcasing, and live streaming.
Digital Human & Virtual Production: Real-time digital human rendering and avatar video synthesis with precise motion and visual consistency.
Video Editing & Augmentation: Extended scenes or complete motion generation from partial inputs (either images or textual prompts), supporting complex editing workflows.
Research Foundation Model: Waver’s modular training and inference recipes, alongside an open-source release, enable the academic community to accelerate progress on unified video generation, benchmark reproducibility, and transfer learning for domain-specific tasks.

A plausible implication is that the convergence of unified architectures and rigorous quality pipelines, such as in Waver, will serve as blueprints for future foundation models tasked with democratizing high-fidelity, real-time digital video generation.

In summary, Waver’s hybrid transformer architecture, multimodal generation flexibility, curated dataset sourcing, optimized training/inference pipeline, and demonstrated empirical success on public leaderboards establish it as a top-tier approach for unified high-fidelity video generation (Zhang et al., 21 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Waver: Wave Your Way to Lifelike Video Generation (2025)

Follow Topic

Get notified by email when new papers are published related to Waver: Unified High-Fidelity Video Generation.