Systematic Exploration of Alternative Semantic Encoders for SemanticGen

Systematically evaluate the impact of substituting the Qwen-2.5-VL vision tower with other video semantic tokenizers—such as V-JEPA 2, VideoMAE 2, and 4DS—when used as the semantic encoder within the SemanticGen two-stage video generation framework, by analyzing how different semantic encoders affect the learned semantic representation distribution and the downstream video generation process.

Background

SemanticGen generates videos by first operating in a compact, high-level semantic space and then mapping to VAE latents. The method relies on a semantic encoder to extract video-level representations; in the paper, the authors use the vision tower of Qwen-2.5-VL to instantiate this component.

The authors note that other video semantic tokenizers (e.g., V-JEPA 2, VideoMAE 2, 4DS) are compatible with the framework and highlight the importance of understanding how the choice of semantic encoder influences training efficiency and final video quality. They explicitly defer a systematic analysis of these alternatives as future work.

References

In this paper, we use Qwen-2.5-VL to validate the effectiveness of the proposed method, and we leave the systematic exploration of other semantic encoders as future work.

SemanticGen: Video Generation in Semantic Space (2512.20619 - Bai et al., 23 Dec 2025) in Section 3.2, Video Generation with Semantic Embeddings (subsection: What Kinds of Semantic Encoders Are Needed for Video Generation?)