Papers
Topics
Authors
Recent
Search
2000 character limit reached

Convolutional Tokenizer in Vision Models

Updated 1 March 2026
  • Convolutional tokenizers are neural modules that use convolutional layers to transform complex image and video data into compact, semantically-rich tokens.
  • They employ encoder-decoder architectures with decoupled spatial and temporal processing, enhanced by techniques like FSQ and binary sign quantization for efficient discretization.
  • These tokenizers drive advancements in video generation, image compression, and multimodal modeling by boosting computational efficiency, training stability, and fidelity.

A convolutional tokenizer is a neural module that transforms high-dimensional image or video data into a compact set of tokens using convolutional architectures, optionally augmented by attention mechanisms, quantization, and curriculum training. Convolutional tokenizers provide efficient and information-preserving representations for diverse generative and discriminative tasks in computer vision and multimodal learning, supporting both continuous and discrete latent spaces. Their significance is established in state-of-the-art systems for video generation (e.g., VidTok (Tang et al., 2024)) and large-scale unified multimodal modeling (e.g., UniWeTok (Zhuang et al., 15 Feb 2026)), where they offer improved computational efficiency, training stability, semantic richness, and fidelity compared to purely transformer-based approaches.

1. Model Architectures and Principles

Convolutional tokenizers employ encoder–decoder frameworks that leverage the locality, weight sharing, and translation equivariance of convolutional layers to extract spatial and/or spatio-temporal features from high-dimensional inputs.

In VidTok, the convolutional tokenizer operates on video XRN×3×H×WX \in \mathbb{R}^{N \times 3 \times H \times W}, generating compact latents Z=fenc(X)Rn×c×h×wZ = f_{\text{enc}}(X) \in \mathbb{R}^{n \times c \times h \times w} and reconstructing the sequence via fdec(Z)f_{\text{dec}}(Z). The design strategically decouples spatial and temporal processing:

  • Spatial Down/Upsampling: 2D convolutions (kernel 3×33 \times 3) with stride rsr_s or transposed convolutions, stacked with LayerNorm and non-linearity.
  • Temporal Down/Upsampling: Causal 1D convolutions in the time direction, combined with an "AlphaBlender" interpolation operator, xαx1+(1α)x2x \leftarrow \alpha x_1 + (1-\alpha)x_2 for α=σ(0.2)\alpha = \sigma(0.2).
  • 3D Convolutions at Bottlenecks: Used only at critical fusion and I/O stages for spatio-temporal context aggregation.

VidTok’s architectural ablations reveal that (a) a fully 3D-convolutional backbone incurs much higher FLOPs (~16.98T) compared to (b) decoupled 2D+1D sampling (~7.17T); (c) adding the AlphaBlender recovers fidelity (PSNR rises from 29.36 to 29.64) with moderate additional cost (10.35T FLOPs) (Tang et al., 2024).

UniWeTok adopts a hybrid model: a convolutional stem (stacked residual 3×3 convs, stride-2 downsampling with parallelized channel expansion) for local feature extraction, followed by a stack of Vision Transformer blocks (LayerNorm, MHSA, MLP) for global semantic modeling. The decoder mirrors this architecture in reverse order, employing transposed convolutions for upsampling (Zhuang et al., 15 Feb 2026).

2. Quantization and Discretization Mechanisms

Efficient tokenization often demands discretization to bound vocabulary size and facilitate downstream generative modeling. Recent convolutional tokenizers advance beyond conventional vector quantization (VQ-VAE) via two approaches:

  • Finite Scalar Quantization (FSQ) (VidTok): Independently quantizes each latent channel to the nearest scalar in a fixed uniform grid v1<<vLv_1 < \ldots < v_L, yielding codebooks of size K=LdK = L^d without explicit learned embeddings. This enables near-100% codebook utilization and stable training in both the presence or absence of commitment losses (Table 4 in (Tang et al., 2024)).
  • Binary Sign Quantization with Bounded Activations (UniWeTok): Features are grouped, passed through a SigLu activation (bounded to [1,1][-1, 1]), and discretized by sign, Z=fenc(X)Rn×c×h×wZ = f_{\text{enc}}(X) \in \mathbb{R}^{n \times c \times h \times w}0. This defines a massive codebook with Z=fenc(X)Rn×c×h×wZ = f_{\text{enc}}(X) \in \mathbb{R}^{n \times c \times h \times w}1 codes per token (for Z=fenc(X)Rn×c×h×wZ = f_{\text{enc}}(X) \in \mathbb{R}^{n \times c \times h \times w}2), implemented lookup-free. The SigLu activation resolves the tension between the token entropy loss (promoting diversity) and commitment loss (enforcing quantization tightness), making the latter superfluous in final training (Zhuang et al., 15 Feb 2026).

Losses for discretized tokenization typically include terms for reconstruction (Z=fenc(X)Rn×c×h×wZ = f_{\text{enc}}(X) \in \mathbb{R}^{n \times c \times h \times w}3), perceptual similarity (LPIPS), adversarial discrimination (Z=fenc(X)Rn×c×h×wZ = f_{\text{enc}}(X) \in \mathbb{R}^{n \times c \times h \times w}4), entropy regularization, and, when relevant, commitment or alignment to pre-trained semantic features (as in UniWeTok’s Pre-Post Distillation).

3. Training Schedules and Curriculum Design

To ensure computational efficiency, fidelity, and semantic generalization, convolutional tokenizers employ multi-stage curriculum training strategies:

VidTok utilizes two stages:

  • Stage 1 (pre-train joint encoder+decoder): Low-res (128×128), low frame rate (3 FPS), large dataset (~10M videos), 50k steps.
  • Stage 2 (fine-tune decoder, freeze encoder): Higher res (256×256), same frame rate, different dataset (~6M high-res videos), 30k steps.

This schedule halves GPU-hours (from ~3072 to 1536) with no measurable degradation (PSNR 29.21 vs. 29.19) and, empirically, reduced training frame rates favor dynamic fidelity by encouraging the latent to model larger inter-frame motions (Tang et al., 2024).

UniWeTok employs a three-phase curriculum:

  • Stage 1: General-domain, single resolution (256×256), combined base losses and alignment/distillation objectives.
  • Stage 2: Multi-resolution batches (including 128×128 and 512×512) to enforce variable-scale robustness.
  • Stage 3: Fine-tuning on content-sensitive datasets (faces, text, logos), enhancing sharpness and OCR performance.

Ablations confirm that each stage incrementally contributes to improved robustness and preserves performance across domains (Zhuang et al., 15 Feb 2026).

4. Continuous vs. Discrete Latent Tokenization

Convolutional tokenizers may operate in either continuous or discrete modes, with distinct trade-offs:

  • Continuous: The encoder's output Z=fenc(X)Rn×c×h×wZ = f_{\text{enc}}(X) \in \mathbb{R}^{n \times c \times h \times w}5 is directly transmitted to the decoder. Regularization typically uses a KL divergence penalty: Z=fenc(X)Rn×c×h×wZ = f_{\text{enc}}(X) \in \mathbb{R}^{n \times c \times h \times w}6. This affords higher raw fidelity, as demonstrated in VidTok where 16-channel latent codes reach PSNR = 35.04, SSIM = 0.942, LPIPS = 0.047, and FVD = 78.9.
  • Discrete: Latents are quantized (FSQ or binary sign), facilitating downstream autoregressive or transformer-based generation (bounded vocabulary, reduced error accumulation). With FSQ (codebook size Z=fenc(X)Rn×c×h×wZ = f_{\text{enc}}(X) \in \mathbb{R}^{n \times c \times h \times w}7K), VidTok attains PSNR = 29.82, SSIM = 0.867, LPIPS = 0.106, FVD = 160.1 (Tang et al., 2024). Discrete tokenization also supports efficient compression and linguistic modeling analogs.

A plausible implication is that continuous tokenizers excel especially in direct compression or retrieval, while discrete tokenizers are necessary for next-token prediction in generative LLMs and transformer pipelines.

5. Empirical Performance and Comparative Analysis

Convolutional tokenizers have demonstrated competitive or superior performance across vision benchmarks:

Model Mode PSNR↑ SSIM↑ LPIPS↓ FVD↓
VidTok (FSQ 32K) Discrete 29.16 0.854 0.117 196.9
VidTok (FSQ 262K) Discrete 29.82 0.867 0.106 160.1
VidTok (KL 4ch) Continuous 29.64 0.852 0.114 194.2
VidTok (KL 16ch) Continuous 35.04 0.942 0.047 78.9

On ImageNet, UniWeTok-H achieves FID = 1.38 (vs. REPA 1.42) with far fewer training tokens (33B vs. 262B) and improved DPG-Bench and editing scores on general domains (Zhuang et al., 15 Feb 2026).

Ablations in UniWeTok reveal that (a) a CNN-only encoder excels at texture recovery (rFID = 1.75) but underperforms in semantic tasks (Top-1 = 11.69%), (b) a Transformer-only encoder maximizes semantics (Top-1 = 26.09%, rFID = 3.38), and (c) the hybrid architecture achieves best overall (rFID = 1.35, Top-1 = 35.41%) (Zhuang et al., 15 Feb 2026).

6. Applications and Broader Context

Convolutional tokenizers are foundational for:

  • Text-to-video and text-to-image generation: Discrete codes can be processed by transformer LLMs (e.g., VideoPoet) or diffusion models.
  • Video and image compression: FSQ (VidTok) with 16 bits/frame-pixel approaches HEVC rates with learned transforms, offering a learned alternative to handcrafted codecs (Tang et al., 2024).
  • Video and image understanding: Continuous latents support downstream tasks such as action classification, retrieval, and in-context querying (e.g., VideoChat).
  • Unified multimodal LLMs (MLLMs): Discrete binary tokenization (UniWeTok) enables a single representation supporting high-fidelity reconstruction, rich semantics, and efficient generation and editing, matching or exceeding continuous-token and specialist discrete-token baselines in empirical evaluation (Zhuang et al., 15 Feb 2026).

In contrast to transformer-based or axial self-attention tokenizers (e.g., VideoGPT, VideoPoet), convolutional tokenizers offer reduced FLOPs, lower memory cost, and improved scalability to high input resolutions or channel counts, providing both architectural and computational advantages.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Convolutional Tokenizer.