Papers
Topics
Authors
Recent
2000 character limit reached

Waver: Wave Your Way to Lifelike Video Generation (2508.15761v2)

Published 21 Aug 2025 in cs.CV

Abstract: We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.

Summary

  • The paper introduces Waver, a unified foundation model that leverages a Hybrid Stream DiT and Cascade Refiner to enable efficient high-resolution video generation.
  • It integrates a comprehensive data curation pipeline and advanced representation alignment to enhance motion synthesis and visual quality across text-to-image, text-to-video, and image-to-video tasks.
  • Empirical evaluations demonstrate that Waver ranks highly on public leaderboards with superior human evaluation metrics, particularly in complex motion scenarios.

Waver: Unified High-Fidelity Video Generation via Hybrid Stream DiT and Cascade Refinement

Introduction

The paper presents Waver, a unified foundation model for high-resolution image and video generation, supporting text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) tasks within a single framework. Waver leverages a Hybrid Stream DiT architecture and a Cascade Refiner to achieve efficient training and inference for 1080p video synthesis, with strong performance in both general and complex motion scenarios. The model is distinguished by its comprehensive data curation pipeline, advanced representation alignment, and infrastructure optimizations, resulting in top-tier rankings on public leaderboards and superior human evaluation metrics.

Model Architecture

Waver's architecture is composed of two principal modules: Task-Unified DiT and Cascade Refiner. The Task-Unified DiT is built on rectified flow Transformers and is designed to jointly model T2I, T2V, and I2V tasks via a flexible input conditioning mechanism. This mechanism concatenates noisy latents, conditional frame latents, and binary masks, enabling seamless task unification and extensibility to video interpolation and other conditional tasks. Figure 1

Figure 1: Architecture of Task-Unified DiT, illustrating the integration of dual and single stream blocks for modality alignment and efficiency.

The Hybrid Stream structure combines Dual Stream blocks (separate parameters for video and text, merged at self-attention) in the early layers for strong modality alignment, and Single Stream blocks (shared parameters) in later layers for parameter efficiency. Empirical results demonstrate that this hybridization accelerates convergence compared to pure approaches. Figure 2

Figure 2: Loss comparison between Hybrid Stream, Dual Stream, and Single Stream structures. Hybrid Stream's loss converges faster.

For positional encoding, Waver employs a hybrid scheme: 3D RoPE for relative spatio-temporal positions and factorized learnable embeddings for absolute positions, enhancing extrapolation and convergence.

The Cascade Refiner, also based on rectified flow Transformers, upscales 720p videos to 1080p using window attention and a two-stage degradation process (pixel and latent). This approach achieves a 40% acceleration over single-stage 1080p generation and corrects generative artifacts. Figure 3

Figure 3: Pipeline of Cascade Refiner, depicting the hierarchical upscaling and artifact correction process.

Figure 4

Figure 4: Refiner output: (a) upscaling and artifact correction; (b) video editing via latent manipulation.

Data Curation and Quality Modeling

Waver's data pipeline integrates multi-source acquisition, advanced segmentation, hierarchical filtering, and semantic balancing. Scene detection is refined using DINOv2 features, and motion magnitude is quantified via RAFT-based optical flow, with foreground segmentation to isolate subject dynamics. Figure 5

Figure 5: Data processing pipeline: acquisition, preprocessing, filtering, captioning, and balancing.

A manually annotated MLLM-based quality model further filters over 1M clips, labeling 13 dimensions of low quality. The model achieves 78% accuracy in high-quality sample prediction. Figure 6

Figure 6

Figure 6: Distribution of video quality issues in annotated training clips.

Hierarchical filtering stages progressively increase quality criteria, culminating in a dataset optimized for each training phase. Figure 7

Figure 7: Hierarchical data filtering funnel for progressive dataset refinement.

Synthetic data generation and manual review are employed to address aesthetic gaps, especially in high-motion scenes. Figure 8

Figure 8: Aesthetic comparison: synthetic data exhibits superior lighting and composition over real-world video.

Figure 9

Figure 9

Figure 9: Manual review statistics for synthetic video samples; distortion is the primary failure reason.

Training and Optimization Strategies

Waver's multi-stage training schedule covers T2I, T2V, I2V, and refinement tasks, with progressive resolution scaling and joint training to enhance motion and instruction following. Representation alignment is enforced via cosine similarity between Qwen2.5-VL semantic features and DiT intermediate features, yielding improved semantic quality. Figure 10

Figure 10: Qualitative comparison: representation alignment produces more organized and meaningful video semantics.

Motion optimization is achieved through low-resolution pretraining, mode-based timestep sampling, and joint T2V/I2V training. Mode sampling yields larger motion amplitudes than logit-normal, as evidenced in ablation studies. Figure 11

Figure 11: 480p T2V results: 192p pretraining enables larger motion.

Figure 12

Figure 12: 720p T2V: mode sampling produces more intense motion than logit-normal.

Figure 13

Figure 13: Probability density functions for timestep sampling: mode distribution is sharply peaked for T2V/I2V.

I2V motion is further improved by introducing initial frame conditioning with a 20% probability during joint training. Figure 14

Figure 14: 720p I2V: joint T2V/I2V training yields larger motion amplitudes.

Foreground motion scoring and quality model filtering remove static and slow-motion clips, ensuring training data supports robust motion synthesis. Figure 15

Figure 15: Foreground optical flow: high vs. low motion scores in training clips.

Aesthetics optimization leverages curated synthetic data and balanced data diets, with high-aesthetic finetuning yielding a 7% improvement in visual quality without degrading motion. Figure 16

Figure 16: Visual quality before and after high-aesthetic finetuning.

Model balancing is achieved via prompt tagging, video APG (CFG decomposition and normalization), and model averaging, with empirical improvements in realism and artifact reduction. Figure 17

Figure 17: Six distinct video styles generated via prompt tagging.

Figure 18

Figure 18: APG hyperparameters yield greater realism and fewer artifacts than standard CFG.

Figure 19

Figure 19: Model averaging improves motion, visual quality, and prompt following.

Prompt rewriting aligns user inputs with training captions, enhancing generation consistency and richness. Figure 20

Figure 20: Prompt rewriting improves visual richness and aesthetic quality.

Infrastructure and Scaling

Waver employs hybrid sharded FSDP, torch.compile, Ulysses sequence parallelism, bucket dataloaders, selective activation checkpointing, and activation offloading to enable efficient training of extreme long-sequence, high-resolution video models. Figure 21

Figure 21: Infrastructure optimizations: sequence parallelism, checkpointing, and offloading.

MFU (Model FLOPs Utilization) increases from 0.32 at 192p to 0.40 at 1080p, indicating effective hardware utilization.

Benchmarking and Evaluation

Waver ranks third on both T2V and I2V tracks of the Artificial Analysis Arena, matching or surpassing state-of-the-art commercial solutions. Figure 22

Figure 22: Official T2V and I2V leaderboard rankings: Waver is third in both.

Human evaluation on Waver-bench 1.0 and Hermes Motion Testset demonstrates superior motion quality, visual quality, and prompt following compared to Veo3, Kling2.0, and Wan2.1, with pronounced advantages in complex motion scenarios. Figure 23

Figure 23: Human evaluation win rates: Waver outperforms competitors, especially in complex motion scenarios.

Empirical Analysis and Future Directions

Sparse attention analysis reveals heterogeneous, layer-wise, and timestep-wise sparsity patterns, motivating adaptive sparse attention mechanisms over fixed-window approaches. NSA-based methods and spatial-temporal sliding windows are promising for future efficiency gains.

VAE optimization is critical: KL loss weight must be balanced to avoid grainy textures and background distortion, while LPIPS loss should be minimized to prevent grid-like artifacts. Future VAEs should target higher compression ratios and multimodal fusion.

Detailed captioning and explicit spatial relation QA pairs are shown to improve instruction following and semantic alignment in T2V models.

Conclusion

Waver establishes a unified, scalable framework for high-fidelity video generation, integrating architectural innovations, data curation, and infrastructure optimizations. The model achieves strong empirical results in both general and complex motion scenarios, with robust motion, visual quality, and prompt adherence. Limitations remain in high-motion detail preservation and expressiveness, suggesting future work in RL-based distortion mitigation and multimodal VAE development. The practical recipes and insights provided are directly applicable to advancing video generation research and deployment.

Whiteboard

Explain it Like I'm 14

Overview

This paper introduces Waver, a powerful AI system that can create realistic videos and images from simple text descriptions (“a cat jumping on a couch”) or from a single starting photo. Waver focuses on making videos that look real, move naturally, and follow instructions closely—especially in tough, fast-moving scenes like sports. It combines several tasks (text-to-video, image-to-video, and text-to-image) in one model and uses a second “refiner” step to boost video clarity to full HD (1080p).

What questions does the paper try to answer?

The researchers set out to solve a few big problems:

  • How can we make AI-generated videos look more real and beautiful, especially at high resolutions like 1080p?
  • How can we handle complicated motion (like basketball or gymnastics) so actions don’t look stiff or fake?
  • How can we train one model that does text-to-video, image-to-video, and text-to-image together, instead of using separate models for each?
  • How can we share enough training details so other teams can reproduce the results and build on them?

How does Waver work? (Methods explained simply)

Think of Waver’s process like making a great movie:

  1. First you plan the scenes (understand text instructions and the starting image).
  2. Then you film a rough cut (a lower-resolution video).
  3. Finally you do a professional edit and upscale to full HD (clean up details, fix artifacts, and sharpen everything).

Here are the main parts:

Unified model for three tasks

  • Text-to-Image (T2I): Type a description, get a picture.
  • Text-to-Video (T2V): Type a description, get a short video (5–10 seconds).
  • Image-to-Video (I2V): Give a single photo, and the model creates the next moments as a video.

Waver handles all three in one system by using a flexible input format that tells the model which frames are “given” (conditions) and which it should create. It’s like giving the model a storyboard: some frames are already decided, others are blank for it to fill in.

Hybrid Stream DiT (the “brain” of the generator)

  • “DiT” is a type of Transformer (a smart pattern-finding tool used in modern AI).
  • Waver mixes two styles of processing:
    • Dual Stream: Treats text and video features separately at first, so the model aligns what the words mean with what the video should do. This improves understanding and instruction-following.
    • Single Stream: Later, it merges them to be more efficient and faster to train.
  • This hybrid approach converges faster (it learns quicker) than using only one style.

Knowing where and when things happen

  • The model uses position signals so it understands time (frame order) and space (where things are on screen). Think of this like a map and a timeline, helping the model keep objects in the right place and moving at the right speed.

Cascade Refiner (the “professional editor”)

  • Generating full HD (1080p) directly is slow and expensive.
  • Waver first creates a strong 720p video, then a second model—the Refiner—upscales it to 1080p and fixes artifacts.
  • It uses “window attention” (looking at small chunks at a time) to save compute while keeping detail.
  • The Refiner also learns to correct typical AI “glitches,” not just blur. Sometimes, with stronger settings, it can even edit content (like changing a person’s appearance).

High-quality training data pipeline

  • Collects a huge amount of video (over 200 million clips) from many sources.
  • Automatically cuts longer videos into meaningful clips by detecting scene changes and motion.
  • Scores clips for:
    • Technical quality (resolution, frame rate).
    • Visual aesthetics (beauty, composition).
    • Motion quality (how objects move vs. camera shake), using optical flow (a way to measure movement between frames).
  • Trains an AI “quality judge” to filter out low-quality or unsafe samples more reliably.
  • Trains a “caption model” to write detailed descriptions, especially of actions and their timing (start/end moments), so the generator learns motion sequences better.
  • Balances the dataset so rare categories (like certain sports) are well represented.

Training recipe and motion tricks

  • Multi-stage training: start small (192p) to learn motion well, then step up to 480p and 720p, and finally refine to 1080p.
  • Joint training of T2V and I2V: Mixing them prevents I2V from getting “stuck” on the first frame and encourages real movement forward in time.
  • Smart “noise scheduling”: This is like choosing which training steps to practice more. They found a schedule that leads to bigger, smoother motion.
  • Representation alignment: Another AI (that’s good at understanding video content) helps the generator stay semantically on track, improving how well the video matches the meaning of the prompt.
  • Aesthetic boost with synthetic data: They make beautiful, creative video samples from high-quality images and use them to finetune the model—carefully and in balance—so it looks great without losing realism or motion.

What did they find? (Main results)

Here are the high-level takeaways:

  • Waver makes 5–10 second videos at 720p, then cleanly upscales to 1080p—with a reported 40% speed-up compared to trying to do 1080p in one shot.
  • It captures complex motion much better than many competitors, with stronger motion amplitude (bigger, more confident movement) and good temporal consistency (actions flow naturally).
  • On public leaderboards, Waver ranks in the Top 3 for both text-to-video and image-to-video, beating most open-source models and matching or exceeding several commercial ones.
  • In tough motion tests (like sports), Waver shows an even bigger advantage, meaning it handles fast, complex action particularly well.
  • Detailed training and data recipes help others reproduce or improve the system.

Why is this important? (Impact and implications)

  • Better creative tools: Filmmakers, educators, and creators can generate high-quality videos from ideas or single images, speeding up content production.
  • Stronger motion realism: Handling sports and quick actions brings AI video closer to real-world use—from product demos to training videos.
  • Unified design saves resources: One model for text-to-image, text-to-video, and image-to-video is more efficient than keeping separate models.
  • Practical guidance for the community: The paper shares how to curate data, train models, and balance quality, motion, and realism. This makes it easier for others to build powerful video generators.
  • Future applications: With robust motion and high-quality visuals, these systems can support virtual try-on, e-commerce showcases, digital avatars, and new forms of storytelling.

In short, Waver shows how to combine smart architecture, careful data preparation, and thoughtful training strategies to make AI-generated videos more lifelike, especially when things move fast.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 14 tweets with 473 likes about this paper.

HackerNews

  1. Show HN: FoundationVision/Waver (1 point, 0 comments)

alphaXiv