Create a Video View Paper

SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization

This presentation explores SweetTok, a novel video tokenizer that achieves remarkable compression efficiency while maintaining high reconstruction quality. By decoupling spatial and temporal dimensions and leveraging semantic information from language model embeddings, SweetTok uses only 25% of the tokens required by existing methods while delivering superior video generation results. The talk covers the innovative Cross-attention Query AutoEncoder architecture, the language-based codebook strategy, and experimental results demonstrating a 32.9% improvement in generation metrics.

Script

Imagine compressing a video down to just one-quarter of its normal size while actually improving its quality. This seems impossible, yet it's exactly what the researchers behind SweetTok have achieved by teaching their tokenizer to understand semantic meaning.

Building on that intriguing possibility, let's examine why video compression has been so difficult.

Current video tokenizers face a fundamental challenge: they treat spatial and temporal information inefficiently, creating redundant representations that demand excessive token counts for reconstruction.

This limitation sparked an innovative solution that rethinks how we represent video data.

The architecture introduces two breakthrough components: a query-based encoder that separates spatial from temporal information, and dual codebooks derived from language model embeddings that capture semantic meaning through carefully chosen word categories.

To train this complex system, the researchers implemented a three-stage curriculum that gradually builds capability, starting with spatial understanding and progressively incorporating temporal dynamics.

Now let's see how this elegant architecture performs in practice.

The experimental results are striking: SweetTok achieves the same reconstruction quality as existing methods while using 75% fewer tokens, and when generating new videos, it outperforms competitors by nearly 33%.

Beyond video, the tokenizer demonstrates strong performance on image data, maintaining semantic-rich latent spaces that support diverse visual understanding applications.

SweetTok proves that understanding meaning, not just pixels, is the key to efficient video representation—a principle that bridges computer vision and natural language processing in powerful new ways.

This semantic-aware approach to tokenization marks an important step toward truly intelligent visual compression. Visit EmergentMind.com to explore more cutting-edge research at the intersection of vision and language.