Proxy Token Synthesis in Diffusion Transformers
- Proxy token synthesis is a computational strategy that partitions latent feature maps into local windows to compute representative tokens for global semantic modeling.
- It employs global proxy self-attention followed by cross-attention injection and a texture complement module to reduce quadratic complexity while preserving fine details.
- Empirical evaluations demonstrate significant GFLOPs reductions and competitive generative performance for high-resolution images and long-sequence video tasks.
Proxy token synthesis is a computational strategy for transformer-based diffusion models that efficiently captures global semantic information while drastically reducing the cost associated with self-attention operations. Deployed in the Proxy-Tokenized Diffusion Transformer (PT-DiT), this mechanism partitions latent feature maps into local windows and computes representative tokens as window averages, facilitating a compact yet expressive global modeling scheme (Wang et al., 2024).
1. Mathematical Formulation and Workflow
Proxy-token synthesis begins with a high-dimensional latent feature map , with frames, spatial dimensions and , and feature dimension . The map is partitioned into non-overlapping windows of size in time, height, and width. Each th window encompasses tokens.
The th proxy token, , is defined as:
where
The aggregated proxies encode the average features for all windows.
Subsequent processing involves:
- Global proxy self-attention: Multi-head self-attention is applied to , yielding .
- Global injection via cross-attention: The global semantics are broadcast to all latent tokens by cross-attention,
injecting aggregated context back to the detailed latent grid.
- Texture Complement Module (TCM): Fine detail is restored using window self-attention (WSA) and shifted-window self-attention (SWSA) following the structure of the Swin Transformer.
The core PT-DiT block executes these components in the following pseudo-code:
1 2 3 4 5 6 7 8 9 |
Input: z_s ∈ R^{N×D} 1. Reshape z_s → R^{f×h×w×D} 2. P_a ← Averaging(z_s; window size = p_t,p_h,p_w) # P_a ∈ R^{M×D} 3. G ← SA(P_a) # global proxy self-attn 4. z_s ← CS(z_s, G) # cross-attn injection 5. Reshape z_s → [N/(p_tp_hp_w), (p_tp_hp_w), D] 6. \hat z ← WSA(z_s) + z_s # window attention 7. z_w ← SWSA(\hat z) + \hat z # shifted-window attention 8. Reshape z_w → (N×D) → Text‐CrossAttn → MLP → output |
2. Compression Ratios and Scaling Behavior
The proxy-token mechanism provides a parameterizable reduction in token count:
- For images: yields
- For :
- For :
- For :
Video processing typically fixes and proportionally adjusts spatial compression, allowing scale-adaptive resource allocation as resolution increases (Wang et al., 2024).
3. Computational Complexity Analysis
Proxy-token synthesis achieves significant cost savings by reducing the quadratic self-attention dependency:
- Standard global self-attention:
- PT-DiT with proxy-tokens and TCM:
- At image sizes $256/512/1024/2048$, proxy blocks contribute relative cost versus full self-attention.
Empirically, GFLOPs are reduced by approximately for the ImageNet scale (vs DiT) and vs PixArt- ($2048$-resolution images). For video at , PT-DiT/H uses only of the GFLOPs of CogVideoX-2B or EasyAnimateV4, with lower memory overhead as frame count grows (Wang et al., 2024). This suggests the mechanism is especially effective for very high-resolution and long-sequence settings.
4. Ablation Studies and Empirical Evaluation
Ablation experiments reveal that proxy-token synthesis quantitatively and qualitatively preserves generative performance:
- Without the Global Injection and Interaction Module (GIIM): FID 50k
- Without the Texture Complement Module (TCM): $69.07$
- Dropping SWSA: $23.59$
Comparison of proxy extraction methods:
- Window averaging (canonical): $19.30$
- Top-Left token only: $20.84$
- Random token: $21.00$
Global injection schemes:
- Cross-attention (canonical): $19.30$
- Linear projection: $20.24$
- Spatial interpolation: $21.82$
Compression ratio at $256$:
- : $19.30$
- : $21.24$
- : $20.43$
Zero-shot MS-COCO FID-30K @256: Qihoo-T2I , demonstrating competitive generative quality with state-of-the-art approaches at substantially lower computational expense. Text-to-video benchmarks (UCF-101 / MSR-VTT) confirm parity or superiority over previous DiT-based architectures (Wang et al., 2024).
5. Mechanistic Rationale and Semantic Function
The proxy-token synthesis paradigm is predicated on the empirical observation that attention maps in local spatial windows are highly correlated, and global modeling need not consume all tokens. Averaged window representations minimize redundancy, enabling self-attention among a reduced set of proxies to efficiently encode global semantic context.
Cross-attention subsequently propagates this context to the full latent set, retaining expressivity without incurring quadratic costs. The window and shifted-window refinements (Texture Complement Module) enforce local detail and avoid discretization artifacts, such as grid patterns (Wang et al., 2024).
A plausible implication is that proxy-token synthesis generalizes well to other domains where spatial or temporal redundancy is endemic, provided local window structure is preserved.
6. Comparison with Baseline and Related Architectures
Quantitative and qualitative analyses indicate that PT-DiT, leveraging proxy-token synthesis, matches or outperforms conventional global-attention-based DiTs in both image and video generation, while drastically reducing hardware requirements and training/inference complexity.
In direct comparisons:
- DiT: Baseline full global self-attention; quadratic cost.
- PixArt-: Competing transformer utilizing global attention; higher GFLOPs.
- PT-DiT: Proxy-tokenized global injection; GFLOPs reduction for images, in video.
These results suggest the proxy-token synthesis strategy is robust and adaptable across a range of generative tasks, including text-to-image (T2I), text-to-video (T2V), and text-to-multi-view (T2MV) settings (Wang et al., 2024).