Systematic Exploration of Alternative Semantic Encoders for SemanticGen
Systematically evaluate the impact of substituting the Qwen-2.5-VL vision tower with other video semantic tokenizers—such as V-JEPA 2, VideoMAE 2, and 4DS—when used as the semantic encoder within the SemanticGen two-stage video generation framework, by analyzing how different semantic encoders affect the learned semantic representation distribution and the downstream video generation process.
Sponsor
References
In this paper, we use Qwen-2.5-VL to validate the effectiveness of the proposed method, and we leave the systematic exploration of other semantic encoders as future work.
— SemanticGen: Video Generation in Semantic Space
(2512.20619 - Bai et al., 23 Dec 2025) in Section 3.2, Video Generation with Semantic Embeddings (subsection: What Kinds of Semantic Encoders Are Needed for Video Generation?)